ABSTRACT 


YUFANG  BAO.  Nonlinear  Image  Denoising  Methodologies. 

(Under  the  direction  of  Prof.  Hamid  Krim.) 

In  this  thesis,  we  propose  a  theoretical  as  well  as  practical  framework  to  combine  geo¬ 
metric  prior  information  to  a  statistical/probabilitstic  methodology  in  the  investigation  of  a 
denoising  problem  in  its  generic  form  together  with  its  various  applications  in  signal/image 
analysis. 

We  are  able  in  the  process,  to  investigate,  understand  and  mitigate  existing  limitations 
of  so-called  nonlinear  diffusion  techniques  (  such  as  the  Perona-Malik  equation)  from  a 
probabilistic  view  point,  and  propose  a  new  nonlinear  denoising  method  that  is  based  on  a 
random  walk  whose  transition  probabilities  are  selected  by  the  information  of  a  two-sided 
gradient.  This  results  in  a  piecewise  constant  hltered  image  and  lifts  the  long-standing 
problem  of  an  unknown  evolution  stopping  time. 

Our  second  contribution  is  in  establishing  a  direct  link  between  multi-resolution  analysis 
techniques  and  so-called  scale  space  analysis  methods,  which  we  in  turn  utilize  to  improve 
the  performance  of  segmentation-optimized  image  analysis  techniques.  This  is  accomplished 
by  using  wavelets  of  higher  order  vanishing  moments,  specihcally,  we  achieve  a  reduction  in 
the  typical  ’’blocky”  artifacts  and  a  better  preservation  of  texture  information. 

Our  third  and  hnal  contribution  is  to  propose  a  drastically  different  approach  by  isolating 
statistically  independent  components  in  a  signal,  which  we  later  use  as  a  basis  for  discrim¬ 
ination  against  noise,  or  potentially  as  plain  features.  This  is  related  to  the  well  known 
independent  component  analysis  (  ICA  ),  for  which  we  first  propose  a— Jensen  -Renyi  di¬ 
vergence  as  an  information-  theoretic  criterion.  In  addition,  we  propose  a  Renyi  mutual 
divergence  as  a  better  criterion  to  separate  mixed  signals  along  with  a  non-parametric  esti¬ 
mation  technique  for  such  a  measure  for  1-D  problems. 
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Chapter  1 
Introduction 


One/two  dimensional  signals  characterized  by  singularities  are  usually  contaminated  by  ad¬ 
ditive  noise,  which  cause  difficulties  in  localizing  them.  This  thesis  will  address  this  issue 
as  related  to  problems  of  signal  restoration,  segmentation  and  edge  detection  as  briefly  de¬ 
scribed  in  this  chapter.  Upon  motivating  and  formulating  the  basic  problem,  we  summarize 
our  contributions  in  this  direction  using  a  stochastic  random  walk,  wavelet  frame  theory  and 
information  measure  as  the  basic  analytical  tools. 

1.1  Problem  Motivation  and  Formulation 

The  primary  goal  of  processing  a  noisy  signal  is  to  obtain  a  reconstruction  as  close  to  the 
original  clean  signal  as  possible,  which,  in  turn,  provides  a  rehable(  hopefully,  robust )  version 
for  segmentation,  and  edge  detection  of  a  signal/image.  The  design  of  a  hlter  targeted  for 
denoising  purpose  is  normally  based  on  some  prior  knowledge  about  the  signal,  e.g.,  staircase 
or  smooth  etc.  In  this  thesis.  Our  approach  to  denoising  is  hrst  based  on  a  controlled 
nonlinear  stochastic  random  walk  to  achieve  a  scale  space  analysis(  as  in  Chapter  2,  3)  to 
enhance  images.  To  better  preserve  texture  in  images,  in  Chapter  5,  we  use  wavelet  frames 
to  simultaneously  improve  the  enhancement  as  well  as  the  segmentation.  In  chapter  7,  we 
introduce  two  new  information  theoretical  approaches  to  extract  independent  components 
from  mixed  signals. 

A  classical  method  to  restore  a  useful  signal  is  to  adapt  a  probabilistic  signal  prior  model 
in  applying  a  statistical  methodology,  such  as  Maximum  A  Posteriori(MAP)  estimation, 
see [32,  31].  A  precise  probabilistic  model  is,  however,  usually  unavailable  and  a  wrong 
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model  may  yield  significant  errors  and  is  unacceptable  in  signal  recovery  problem.  An  effort 
of  adopting  a  more  objective  energy  functional  maybe  constructed  to  derive  a  MAP-like 
principle  by  line  of  variational  formulation  approach  and  rooted  in  the  intrinsic  geometry 
and  smoothness  of  the  signal.  The  optimization  of  such  a  functional  by  way  of  the  Euler- 
Lagrange  equation  yields  a  steepest  gradient  descent  search  for  the  optimal  signal/image. 
The  so-obtained  partial  differential  equation  (PDE)  is  an  evolution  of  a  signal/image  through 
scales. 

Scale  space,  first  introduced  as  a  homogenous  dynamic  low-pass  filter  linearly  smooth 
away  noise  with  increasing  of  scale,  was  extended  to  a  nonlinear  selective  smoothing  by 
Perona-Malik  [74,  75,  89].  This  triggered  an  intense  interest  in  searching  for  new  non¬ 
linear  filters  to  better  preserve  features  [73,  78,  93,  87,  60,  94,  15,  58].  While  simple  to 
implement,  these  procedures  become  complex  involved  for  noise  contaminated  images[8,  91]. 
Several  improvements  have  been  proposed  since,  and  for  example,  a  more  flexible  tech¬ 
nique,  which  has  been  shown  to  be  equivalent  to  anisotropic  equation  is  that  of  Mean-field 
annealing(MFA)[40],  for  which,  the  parameter  choice  is  much  simpler. 

Most,  if  not  all,  of  existing  techniques  have  been  predominantly  deterministic  in  nature, 
with  little  or  no  stochastic  treatment  or  interpretation  of  the  diffusion.  In  addition,  unless 
a  specific  stopping  time  is  known  to  be  adequate,  the  resulting  evolution  equation  is  well 
known  to  almost  always  lead  to  a  complete  smoothing  of  the  signals(  i.e.,  the  steady  state  of 
the  PDE).  Poliak  et.  al.  [77,  78]  recently  proposed  an  approach  addressing  robustness  issues, 
and  showed  some  remarkable  results  for  a  wide  class  of  perturbation  noises.  The  analysis 
remained  as  in  all  other  cases,  fundamentally  deterministic,  and  also  required  knowledge  of 
the  stopping  time  for  the  evolution. 

From  a  probabilistic  vantage  point,  the  characteristics  of  the  random  process  which  un- 
derly  the  diffusion  have  so  far  been  overlooked,  and  their  overall  influence  on  the  solution 
in  different  scenarios  has  remained  unclear.  One  of  our  goals  in  this  thesis  is  to  first  detail 
a  probabilistic  framework  which  helps  us  provide  an  alternative  view  of  the  nonlinear  diffu¬ 
sion  problem.  This  in  turn,  is  instrumental  in  our  providing  an  alternative  interpretation  of 
existing  methods,  e.g.  Perona-Malik  equation,  and  in  using  the  gained  insight  to  propose  a 
solution  to  its  well  known  limitations.  More  specifically,  we  view  an  evolution  equation  by 
way  of  a  controlled  diffusion  [57]  strategies  as  a  solution  resulting  from  an  optimization  of  an 
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energy  functional.  This  ultimately  leads  to  a  two/four  state  Markov  Chain  (MC)  with  one 
step  transition  probabilities  well  adapted  to  preserving  the  salient  features  of  a  signal/image( 
such  as  edges)  while  smoothing  away  the  noise.  As  will  be  elaborated  on  further  below,  in 
addition  to  a  marked  performance  improvement  over  P-M  equation,  and  by  way  of  our  newly 
proposed  technique,  we  are  able  to  lift  a  longstanding  problem  in  nonlinear  diffusion,  namely 
requiring  to  have  prior  knowledge  of  the  stopping  time.  We  in  fact  show  that  the  stable  point 
for  our  equation  is  a  staircase  function. 

The  resulting  image  as  a  staircase  function  is,  however,  at  a  cost  of  a  loss  of  texture 
in  the  image.  This  as  further  elaborated  on  below,  is  inherent  to  the  first  order  Markov 
property  assumed  for  the  image  and  implicit  in  the  edge  modelling(by  a  first  order  difference 
gradient).  In  the  second  part  of  our  work,  we  address  the  texture  loss  problem  in  the  course 
of  smoothing  by  the  before  mentioned  technique. 

As  we  can  see,  piecewise  picture  is  the  best  effort  we  may  obtain  through  evolution  so 
far.  On  the  other  hand,  wavelet  theory  provides  various  methods  to  explore  the  intrinsic 
properties  of  a  signal,  wavelets  of  higher  order  vanishing  moments  result  in  fewer  large  detail 
coefficients  if  a  function  is  smooth,  and  the  decomposition  of  a  signal  into  a  wavelet  frame 
reveals  redundant  information.  We  also  can  see  that  wavelet  packets  provide  a  tool  to  explore 
more  detail  content  from  spectral  domain  of  view.  Inspired  by  these  facts,  a  nature  question 
then  arose  is  whether  we  can  investigate  the  interplay  between  PDE-based  filtering  and 
multiscale  analysis.  This  promising  idea  is  implemented  and  a  texture  preserved  algorithm 
is  proposed  as  shown  in  Chapter  5. 

We  show  that  using  frames  of  wavelet  of  higher  order  vanishing  moments  than  Haar’s  is 
tantamount  to  accounting  for  longer  term  correlation  structure,  while  preserving  the  local 
focus  on  equally  important  features(e.g.  edges).  This  hence  yields  an  efficient  tool  in  ana¬ 
lyzing  and  in  enhancing  images  with  a  careful  account  for  texture  information.  We  propose 
to  decompose  images  into  Daubechies  4  based  wavelet  frames,  where  redundant  information 
will  be  generated.  That  information  is  useful  when  we  deal  with  noisy  image  although  it  is 
not  necessary  when  we  try  to  reconstruct  image.  The  problem  is  that,  we  would  like  to  re¬ 
cover  a  clear,  enhanced  version  of  the  original  one  if  the  picture  given  is  noisy.  We  maintain 
that  potentially  useful  information  lies  in  the  redundant  representation  of  a  signal/image 
and  should  be  fully  exploited. 


CHAPTER  1.  INTRODUCTION 


4 


Our  third  approach  to  separating  and  localizing  various  components  of  a  signal  process 
is  to  ensure  a  statistical  independence  among  them,  thereby  also  affording  one  to  extract 
features  of  importance.  This  will  be  achieved  by  seeking  to  extract  independent  components 
and  whose  higher  order  components  is  better  regarded,  as  it  better  separate  signals  from 
others.  While  many  approaches  spanning  higher  order  statistics  to  learning  algorithms  have 
been  proposed,  we  proposed  two  new  information  measures  which  depend  on  the  probability 
density  functions. 

With  a  improve  of  a  non-parametric  estimation  of  mutual  information,  we  propose  a  non- 
parametric  Renyi  mutual  divergence  approximation  using  dependent  data,  which,  together 
with  the  better  measurement  property  of  Renyi  mutual  divergence  over  mutual  informa¬ 
tion  enable  us  to  practically  apply  it  as  an  alternative  criterion  to  ICA.  We  also  propose 
using  q;— Jensen- Renyi  divergence  that  was  recently  developed  ([37,  1])  as  a  new  indepen¬ 
dent  measure  among  more  than  two  pdf’s  in  lieu  of  mutual  information.  We  show  it  to 
improve  performance  in  separating  sources[5],  as  one  way  impose  weighting  priors  (hence 
contribution)  of  different  data  sets. 

1.2  Summary  of  Thesis  Main  Contributions  and  Organization 

In  Chapter  3  we  propose  a  stochastic  framework  where  nonlinear  diffusions  are  cast  and 
are  given  an  insightful  interpretation  towards  understanding  their  intrinsic  behaviors.  We 
reinterpret  a  linear  evolution  partial  differential  equation(PDE)  as  a  direct  result  of  a  mean 
value  of  random  walk  functional.  In  addition,  Perona- Malik  equation  is  also  interpreted  as 
a  controlled  random  walk  for  which  an  adjusted  energy  functional  yields  a  much  improved 
algorithm.  This  in  fact  results  in  a  new  diffusion  method  based  on  two-sided  gradient  is  pro¬ 
posed  in  section  2.6,  which  yields  piecewise  constant  hltered  images.  Additional  extensions 
were  also  proposed  [49]. 

In  Chapter  5,  We  propose  a  new  evolution-equation  based  technique  that  utilizes  multi¬ 
resolution  wavelet  frame  coefficients.  Wavelet  frame  coefficients  include  redundant  infor¬ 
mation  and  when  the  wavelets  are  of  higher  order  vanishing  moments,  a  longer  correlation 
structure  is  account  for.  We  provide  a  brief  contextualization  and  statement  of  the  problem. 
An  explanation  of  the  decomposition  and  reconstruction  is  given  in  section  4.3,  where  we 
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also  prove  that  the  Heat  diffusion  is  equivalent  to  subtracting  second  level  detail  of  Haar 
frame  coefficients.  In  section  4.4,  we  explain  from  a  spectral  perspective  the  effect  of  a  van¬ 
ishing  moment  of  a  wavelet  and  proceed  to  derive  the  detailed  implementation  equations. 
We  hnally  provide  some  substantiating  denoising  image  examples  as  a  conclusion  in  4.8.  The 
major  results  of  this  chapter  have  been  published  in  [4,  6]. 

In  Chapter  6,  we  provide  a  brief  review  independent  component  analysis  (ICA)  and 
discuss  various  contrast  functions  used  to  recover  independent  source  signals.  It  is  shown  that 
all  the  contrast  functions  are  in  fact  related  to  mutual  information  and  the  MLE  principle. 
This  subsequently  leads  us  to  propose  new  information  criteria,  such  as  JR  divergence  and 
Renyi  mutual  divergence  -Examples  are  also  provided. 

In  Chapter  7,  We  expound  on  these  measures,  show  their  application  to  ICA  and  de¬ 
velop  non-parametric  technique  for  approximating  the  Renyi  mutual  divergence  by  a  cell 
approximation  algorithm. 

In  Chapter  8,  we  provide  some  extensions  and  new  research  areas  for  future  work. 


Chapter  2 

Scale  Space,  Diffusion  and  Variation 


Scale  space,  known  as  a  collection  of  signals  output  from  a  dynamic  filter  whose  transform 
functions  varied  with  time/scale,  provides  a  flexible  choice  to  meet  different  processing  pur¬ 
poses  by  specifying  a  time/scale.  Scale  space  hltering  is  effected  via  a  partial  differential 
equation(PDE),  which,  as  we  will  see  in  the  following,  is  related  to  an  underlying  particle’s 
motion  which  is  in  turn  governed  by  a  stochastic  differential  equation(SDE).  Our  motivation 
of  the  scale  space  methodology  proceedly  exploiting  the  tight  connection  between  a  PDE 
and  a  SDE,  and  the  subsequent  stochastic  interpretation  of  the  PDE  solution.  The  interplay 
between  a  PDE  and  an  energy  functional  yields  a  deep  insight,  which  in  turn  as  we  will 
elaborate  in  later  chapters,  leads  to  a  clarifying  and  solving  some  outstanding  problems. 

2.1  Scale  Space  Concept 

Scale-based  analysis  has  recently  played  an  increasingly  important  role  in  signal  and  image 
analysis  since  Witkin’s  ground  breaking  paper  [95],  in  which  a  so-called  linear  scale  space  was 
constructed  and  the  following  linear  evolution  partial  differential  equation(PDE)  effecting 
the  filtering  was  proposed 

=  AU{t,x) 

=  fix).  (2.1.1) 

The  symbol  A  denotes  a  Laplacian  operator  acting  on  hltered  signals  U{t,x),  which  may  be 
interpreted  as  copies  of  an  original  signal  U{0,x)  at  a  hxed  time/scale  t.  This  was  based  on 
the  conclusion  that  convolving  a  signal  with  a  Gaussian  kernel  was  equivalent  to  evolving  it 


dU (t,  x) 

Wt 

f/(0,a;) 
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with  a  Heat  differential  operator  as  shown  in  the  next  equation,  where  time  now  plays  the 
role  of  scale  [48,  97], 


U (t,  x) 


1 

— ,  exp 

(7(2^)“ 


*  U (0,  x) 


where  x  =  {xi,  ■  ■  ■  ,  Xd)  G  77'^,  |a;|  = 
a  frequency  domain’s  viewpoint,  this  equation  is  equivalent  to 


*  I  xf,  denotes  convolution  of  two  functions.  From 

\  i=l 


U{t,ui)  =  ij{I),ui)exp{—t\u}f‘). 


where  U{t,u)  represents  the  spatial  Fourier  Transform  of  U{t,x).  This  clearly  explains  that 
the  high  frequency  content,  which  represents  sharp  features  as  well  as  noise,  is  removed 
when  t  increases  or  the  scale  becomes  coarser  and  coarser,  i.e.,  both  noise  and  details  will 
be  smoothed  out  and  no  new  information  added. 

The  interest  in  scale  space  analysis  stems  from  the  fact  that  optimally  processing  image 
features  may  be  tracked  across  the  scale.  The  latter  may  be  made  to  vary  nonlinearly  with 
a  proper  modihcation  of  the  Gaussian  kernel. 

Linear  heat  diffusion,  hrst  introduced  as  a  homogenous  scale  space  in  hltering  theory 
by  Witkin([95]),  is  an  isotropic  method  that  smooth  signals  with  Gaussian  kernel  uniformly 
by  increasing  scale,  which  removes  important  features  along  with  noise,  this  lead  to  the  de¬ 
sire  of  scale  space  approaches  that  naturally  preserve  the  intra-scale  correlation  information. 
An  approach  deployed  such  scale  information  was  hrst  proposed  by  Perona  and  Malik  in 
their  landmark  paper ([74])  and  was  aimed  at  preserving  important  sharp  features  such  as 
edges.  Their  technique  may  also  be  viewed  as  a  nonlinear  hlter  whose  selective  smoothing 
is  based  upon  the  computed  local  gradient  (maximal  smoothing  in  low  gradient  or  homoge¬ 
neous  regions,  and  minimal  smoothing  in  high  gradient  regions),  where  signals  are  viewed 
as  nothing  but  piecewise  constant  functions.  The  novelty  of  this  approach  together  with 
its  very  promising  results  triggered  a  tremendous  research  activity  in  computer  vision  and 
applied  mathematics  [73,  78,  93,  87,  60,  94],  where  its  mathematical  properties  as  well  as  its 
numerical  implementations  and  applications  were  investigated.  A  slight  regularization  by  a 
Gaussian  kernel  to  hnest  smooth  a  noisy  signal  was  proposed  in  [15,  58]  prior  to  implementing 
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nonlinear  selective  smoothing  on  the  signal.  On  the  other  hand,  this  was  interpreted  in  [63] 
as  a  robust  estimation  that  resulted  in  an  edge-stopping  function  to  be  applied  to  gradient. 
Many  approaches  have  been  proposed  to  address  a  variety  of  issues  specihc  to  images,  such 
as,  image  enhancement,  segmentation,  and  edge  detection,  which  have  been  hgured  among 
the  most  often  studied  on  account  of  their  great  relevance  to  low-level  vision.  [74,  75]  (see 
[89]  for  a  good  review  of  the  literature). 

2.2  Probabilistic  View  of  Diffusion 

The  fact  that  scale  space  analysis  is  dehned  by  a  PDE-based  diffusion  is  essential  to  its 
probabilistic  interpretation.  Diffusion  is  used  to  describe  a  physical  phenomenon  that  governs 
the  transport  of  heat  flow  moving  from  a  high  to  a  low  spatial  concentration.  This  may  in 
turn  be  investigated  as  a  stochastic  process  of  an  underlying  particle’s  movement.  This  is 
described  by  a  stochastic  differential  equation(SDE)  which  has  a  corresponding  macroscopic 
(by  the  theory  of  large  numbers)  manifestation  by  way  of  a  PDE  as  discussed  next. 

2.2.1  Diffusion  and  SDE 

A  stochastic  process  Xt{x)  may  be  dehned  as  a  parameterized  collection  of  random  variables 
{Xt{x)}^  g  jQj,]  dehned  on  a  probability  space  (D,  JF,  P)  and  assuming  values  in  TT.”  in 
general,  so  that, 

V  a;  G  D,  ci;  — >  Xt{x,ui),  tG  [0,T], 

where  D  is  the  usual  sample  space,  T  the  a-held  and  P  the  probability  measure.  A  nice  and 

intuitively  appealing  interpretation  for  ix  is  that  of  a  moving  particle  whose  starting  position 

is  X  at  time  0  and  whose  position  at  time  t  is  given  by  Xt{x,ui).  We  will  write  Xgt  or  Xst{x) 

to  denote  a  random  process  starting  at  x  at  time  s,  and  currently  at  time  t. 

Definition  1.  Let  b(t,  x)  and  a(t,  x)  he  continuous  in  t,  x  and  assume  that  for  some  constant 
K  eR 

\b{t,x)\‘^  +  \a{t,x)\‘^  <  K{1  +  \x\^)  (2.2.1) 

and  that  for  each  N  E  R,  3  Lj^  with  |a;|  <  N,  \y\  <  N,  for  which 

\b{t,x)  -  b{t,y)\  +  \(j{t,x)  -  (T{t,y)\  <  Ln\x  -  y\.  (2.2.2) 

A  d-dimensional  stochastic  process  Xt  =  (X/,  •  ■  ■  ,  XfY  that  satisfies  the  following  stochas¬ 
tic  differential  equation(SDE)  exists  and  is  called  an  Ito- diffusion [71], 
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dXt  =  hit,  Xt)dt  +  a{t,  Xt)dBt,  (2.2.3) 

where  Bt  is  a  dimension  m  standard  Brownian  motion  vector  and  b(t,  x)  is  a  d  x  1  drift 
coefficient  vector  and  a{t,  x)  is  a  dxm  diffusion  coefficient  matrix(  in  this  paper,  we  normally 
consider  d  =  1,  d  =  2  and  m  =  1  case). 

Diffusion  processes  defined  above  have  continuous  paths,  and  when  a(t,  x)  =  a{x),  h(t,  x)  = 
h{x),  the  diffusion  processes  are  homogenous  processes.  We  denote  by  p{s,  x,  t,  dy)  the  prob¬ 
ability  transition  function  of  a  stochastic  process  Xt,  i.e.,  p{s,  x,  t,  dy)  =  P{Xt  G  dy\Xs  =  x), 
and  by  p{s,  x,  t,  y)  the  probability  transition  density  function  of  a  particle  starting  at  location 
X  at  time  s  and  reaching  y  at  time  t.  As  described  by  the  following  theorem,  the  transi¬ 
tion  probability  is  normally  determined  by  the  drift  and  the  diffusion  coefficients,  which 
characterize  how  the  diffusion  behaves  as  well.  The  inhnitesimal  generator  (i.e.,  continuous 
operator  which  describes  such  a  motion)  of  the  diffusion  in  Eq.  (2.2.3)  can  then  be  written 
as  : 

1  02^^  n 

ij=l  2=1 

where  a(t,x)  =  a{t,x)a{t,x)'^ . 


2.2.2  Kolmogorov’s  Backward  and  Forward  Equations 


While  our  focus  herein  is  on  clarifying  the  situations  which  needed  to  be  consider  for  unrav¬ 
elling  the  connection  between  a  diffusion  process  and  its  corresponding  PDE,  the  details  of 
the  theorems  below  may  be  found  in  [3,  29,  34]. 

Theorem  1.  Let  p{s,x,t,y)  be  the  transition  probability  density  function  of  diffusion  process 
Xt,il  <t<T  with  continuous  coefficients  h{t,x),a{t,x)  that  satisfy  certain  conditions,  then 
p  is  a  so-called  fundamental  solution  of  the  Kolmogorov’s  backward  equation  with  Lg  given 
in  Eq.  (2.2.4), 


dp 

-^  +  Lgp  =  0 

os 

\iinp{s,x,t,y)  =  5{x  —  y) 

s]t 


(2.2.5) 


Theorem  2.  Let  a  transition  density  function  p{s,  x,  t,  y)  of  a  diffusion  process  satisfy  certain 
conditions,  and 

^  d{bi{t,y)p)  d‘^{a{t,y)p) 
dt  ’  dyi  ’  dyidyj 


(2.2.6) 
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exist  and  be  continuous,  then  p{s,x,t,y)  is  a  fundamental  solution  of  Kolmogorov’s  forward 
equation  for  fixed  s  and  x  such  that  s  <t. 


dp 

m 

lim  X,  t,  y) 

tis 


l;p 

S{x-y) 


where  L*  is  the  adjoint  operator  of  L  given  by 


LlP  =  E 

*j=i 


d^{aij{t,y)p) 

dyidyj 


E 

2=1 


d{bi{t,y)p) 

dyi 


(2.2.7) 


(2.2.8) 


According  to  these  two  fundamental  solutions  of  Kolmogorov’s  equations,  the  following 
probabilistic  solutions  of  Kolmogorov’s  equations(PDEs)  are  formulated  in  term  of  the  initial 
conditions. 


Theorem  3.  [25]  Assume  f  and  Lt  satisfy  certain  technical  conditions,  then  U(t,x)  = 
E{f{Xs^t{x))}  =  f  f(y)P(s,x,t,y)dy,  s  <  t,  is  the  solution  of  the  following  PDE  that  has 
an  infinitesimal  operator  Lt  as  in  Eg.  (2.2.4), 


dU(t,x) 

Ft 

U{s,x) 


LtU(t,x) 

fix)- 


(2.2.9) 


Theorem  4.  [34]  Assume  f  and  Lt  satisfy  certain  technical  conditions,  then  U{s,x)  = 
E{f{Xs^t{x))}  =  I  f{y)P{^j^ATy)dy,  where  s  <t,  is  the  solution  of  the  following  PDE  that 
has  an  infinitesimal  operator  Eg  as  in  Eg.  (2.2.4), 

=  0 

=  f{x)  for  s  4  t.  (2.2.10) 


uu  X ) 
ds 


+  LgU  (s,  x) 
U{s,x) 


Theorem  5.  [34]  Assume  f  and  Lt  satisfy  certain  technical  conditions,  then  U(t,y)  = 
f  f(x)P(s,x,t,y)dx  ,  s  <  t,  then  U(t,y)  is  the  solution  of  the  following  PDE  that  has  an 
infinitesimal  operator  L[  as  in  Eq..(  2.2.8), 


dU {t,  y) 
dt 

U{t,y)^f{y) 


LIU  it, y) 
for  t  J,  s. 


(2.2.11) 


With  a  closer  look  at  the  solution  of  Eq.  (2.2.11),  we  can  see  that  the  solution  can  not 
be  expressed  as  a  mean  value  of  a  random  process  since  it  is  integrated  over  all  the  initial 
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position 


Figure  2.1:  Sample  pathes  of  a  random  process  that  begin  at  x  at  time  s  and  arrive  at  y  at 
time  t 

positions  and  is  in  contrast  to  the  case  in  Th.  6.  In  order  to  clarify  the  difference,  we  provide 
two  hgures,  one  of  which  (  £g.  2.1  )  corresponds  to  a  solution  of  Eq.  (2.2.11). 

Next,  we  consider  diffusions  where  the  evolution  time  direction  is  reversed,  a  so-called 
backward  diffusion(see  fig.(  2.2)  and  reference  [55]  [54]  ),  for  which  an  alternative  form  of 
Kolmogorov’s  backward  equation  is  given  as 

Theorem  6.  Assume  that  f  and  Lt  satisfy  certain  technical  conditions.  The  expected  value 
U(t,x)  =  E{f{Xs,t{x))}  is  the  solution  to  the  following  backward  PDF  on  R^ 


dU{t,  x) 


+  L[U{t,x)  =  0 

U{t,x)  =  f{x)  fort  Is 


(2.2.12) 


where  L[  is  an  alternative  adjoint  operator  of  Lt  and  is  defined  as 


Lt  —  ^  ^  dijit, : 


dxidxj 


- 


(2.2.13) 


and  Xs^t,s  <  t  is  a  backward  diffusion  process,  it  also  denoted  as  Xs{x)  with  the  initial 
condition  Xt  =  x.  it  can  also  be  defined  as  a  forward  diffusion  per  Definition  1  by  writing 
Xs{x)  =  Xt-s{x)  =  Xt,D  <t  <T,  Xs{x)  satisfies  the  following  stochastic  integral  equation, 


Xs,t{x)  =  -  b{r,Xr,t{x))dr  ak{r,  Xr,t{x))dB^^ 
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Figure  2.2:  Random  process  sample  pathes  that  begin  with  position  y  at  time  t  and  arrive 
in  X  at  time  s  at  the  inverse  time  direction 


2.2.3  Example:  Brownian  Motion 

An  illustrating  example  of  a  linear  diffusion  is  the  process  described  by  the  PDE  in  Eq. 

(2.1.1),  in  which  L  is  specihed  as  a  Laplacian  operator  A  (i.e.,  7—  +  7-77).  Brownian  Motion, 

ox^  oy^ 

clearly  a  homogeneous  random  processes,  therefore  yields  a  transition  probability  density 
function  denoted  by  pit,  x,  y)  =  p{s,  x,s  +  t,y),  where  x  and  y  may  be  exchanged  as  a  result 
of  the  symmetry  property  of  this  diffusion.  This  property  is  rare  for  most  of  other  processes. 
As  noted  earlier,  diffusion  of  heat  in  a  homogeneous  medium  fundamentally  stems  from  the 
motion  of  particles,  and  it  can  be  shown  that  the  inherent  randomness  of  this  motion  is  well- 
described  by  a  Brownian  motion  Bt  [71] ,  where  an  individual  outcome  cu  G  in  the  prevailing 
sample  space,  may  be  associated  to  a  particle.  The  process  Bt  may  then  be  interpreted  as, 
originating  at  time  0  from  position  x  (assumed  0  for  simplicity),  the  distance  travelled  by 
particle  u  at  time  t.  It  is  well  known  that  a  transition  probability  density  for  a  Brownian 
motion  in  1-D  case,  for  instance,  is  a  Gaussian  PDF  pit,  x,  y)  =  — — 2^  \/  x,  y  E 
TZ,  t  >  0  (recall  that  a  Brownian  motion  has  independent  Gaussian  increments).  It  is 
thus  clear  that  a  stochastic  interpretation  of  a  solution  (if  it  exists)  subjected  to  some 
differentiability  conditions,  can  be  given  by  way  of  an  ensemble  average  [25] 


U{t,x)  =  E^{f{Bt)}  =  /  p{t,x,y)f{y)dy, 

Jn 


(2.2.14) 


where  the  expectation  E{-)  is  computed  over  all  possible  reachable  positions  y  starting  at 
position  X.  In  the  2-D  case,  it  is  similarly  possible  to  have  such  an  interpretation  as  displayed 
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Figure  2.3:  A  particle  (pixel)  may  diffuse  over  many  possible  paths,  and  an  average  is  usually 
computed. 

in  Fig.  2.3.  The  times  t  =  ti  and  t  =  t2  are  the  instants  at  which  all  possible  positions  are 
averaged  to  yield  a  solution  at  the  respective  times. 

2.3  Variational  Methodology 

While  variational  methods  have  been  investigated  in  problems  where  minimizing  cost  is  of 
interest,  i.e.,  energy  functionals  derived  from  a  MAP  principle[32,  31,  35],  it  has  been  recently 
adopted  to  explain  the  mathematical  foundations  of  scale  space  analysis  from  an  optimization 
theoretic  viewpoint  [62] .  Specihcally,  many  existing  evolution  equations  were  shown  to  result 
from  a  minimization  of  energy  functionals.  The  resulting  Euler-Lagrange  equations  lead  to  a 
steepest  gradient  descent  method  giving  rise  to  a  PDE.  Using  this  approach,  we  can  establish 
the  well  known  result  that  the  Brownian  motion  is  a  result  of  minimizing 

E{u)  =  j  \T/u\^dx,  (2.3.1) 

The  resulting  PDE  as  given  in  Eq.  (2.1.1)  generates  different  copies  of  the  image,  at 
different  scales,  and  in  light  of  the  above  probabilistic  interpretation,  effects  a  particle  motion 
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from  a  region  of  high  density  to  one  of  low  density.  Additional  insight  is  achieved  by  way  of 
the  following  theorem  applied  to  a  generic  functional. 

Theorem  7.  For  an  energy  function  that  takes  the  form 

E{u)  =  [  f{u,  '\/u)dx  (2.3.2) 


where  f{u,  Vu)  is  given  as 


f{u,Vu) 


Id  o  o  d 

uu  uu 

2  2-^  +  2-^ 

i,j=l  2=1 


U 


du 

dxi 


(2.3.3) 


with  Oij  =  Oji  and  Oij  satisfy  certain  conditions  as  in  definition  1,  then  the  resulting  PDE 
from  steepest  gradient  descent  method  is  given  as  Eg.  (2.2.9).  This  means  that  the  underlying 
particle  motion  is  homogenous  and  governed  by  a  SDE  described  by  an  infinitesimal  operator 

Lt  of  Eg.  (2.2.4)  aij(t,x)  =  aij,bi(t,x)  =  bi  and  aij  =  aji. 


Proof:  See  Appendix. 

The  infinitesimal  operator  as  in  Eq.  (2.2.4)  now  further  invokes  a  gradient  of  the  function 
of  interest.  This  is  reflected  by  the  energy  functional  and  its  precise  effect  is  only  clear  when 
Qij  >  0,  i  =  j  and  =  0,  i  ^  j,  (note  that  additional  interactions  among  the  component 
are  present  when  Ojj  0,  i  ^  j  ),  and  6*  =  0  (as  in  the  Laplacian  case). 

One  may,  however,  consider  other  generalizations  (general  a{t,x),  b{t,x)  )  for  more  elab¬ 
orate  effects  as  a  function  of  scale  and  space.  A  ease  in  point  is  that  of  avoiding  the  trivial 
smoothing  of  an  image  resulting  from  a  linear  heat  equation,  and  that  of  rather  present¬ 
ing  key  features  through  nonlinear  transformations  to  slow  down/eliminate  some  specific 
filtering. 

A  number  of  very  good  papers  have  provided  inspiring  variational  interpretations  to 
various  nonlinear  smoothing  techniques  [89,  83,  96,  60]  and  thus  proposed  their  specific 
generalized  denoising  methods.  In  [83]  such  an  approach  resulted  in  a  constrained  total 
variation  with  a  gradient  descent  formula.  A  very  clear  explanation  of  nonlinear  diffusion 
resulting  from  variational  methodology  is  given  in  [96] . 
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2.4  Appendix 


Proof  of  Theorem  7: 


df 


Note  that  /•  (|/o,  l/i,  •  • '  ,  l/rf)  =  i  =  0,1,  ■■■,(!.  For  u{x),  x  =  {xi,X2,- ■  ■  ,Xd), 

oyi 

(dvj  dvj  du  \ 

■  ■  ■  ’  ^  j  ’  ^  vector  function  f{u{x))  =  {fi{u),  f2{u),  ■■■  ,  fd{u)), 

denoted  div{f{u))  =  - .  To  keep  the  following  expression  simple,  we  also  denote 

OXi 


dh 


2=1 


hn  =  h,  hi  =  — — ,  for  i  =  1,  2,  •  •  •  ,  d.  According  to  the  Gateauex  differential  dehnitionfSee 
ux  ■ 

[62]),  we  have  for  V  h  G  L‘^{R‘^), 


6E(u)  =  lim 

A^O 


E{u  +  \h)  —  E{u) 
A 


f{u  +  \h,  Vu  +  AVh)  -  f{u,  Vu)  , 
hm - ^ - ax 


A^O 


A 


^f'i{u,Vu)h!idx 


i=0 


/  fi  {u,  T/u)hdx  —  /  div{V  f{u,T/u))hdx 
'90  j_g  J  O 


diviy  f{u,  Vu))hdx 


(2.4.1) 


The  last  equation  =  is  established  given  that  ,  E  f\  {u,  Vu)hdx  =  0. 
let 


'do. 


i=0 


/ (dO)  DIt  '  '  '  I  Vd)  cy  ^  ^  ^ijUiUj  T  'y  ^  b, 


yoVi 


ij=l 


2=1 


It  is  clear  that 


fo(yo,yi,  ■■■  ,yd)  = 

2=1 


(2.4.2) 


(2.4.3) 
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and  for  /c  =  1,  2, . . . ,  d, 


thus 


fkiVO')  Vli  '  '  '  1  Ud)  o  j  ^  ^jkVj)  j  “1“ 


0=1 


^  ^  ^kjVj  A  bkVo 
i=i 


dzv{Vf{u,Vu))  =  J2 


df'{u,Vu) 


2=1 


dxj 


d'^u 


i,j=^ 


2  =  1 


du 
^  dxi 


(2.4.4) 


(2.4.5) 


since  h  is  an  arbitrary  function,  to  obtain  SE{u),  we  see  that  we  only  need  to  have 

diviy  f{u,Vu))  =  0, 

which  according  to  the  steepest  gradient  descend  method,  the  following  PDE  is  required 

du{t,  X 


dt 


=  diviy  f{u,Vu))  =  Ltu{t,x) 


(2.4.6) 


which  proves  Theorem  7. 


Chapter  3 

Nonlinear  controlled  diffusion 


A  nonlinear  controlled  diffusion[56,  28,  26,  71]  is  a  different  strategy  to  achieve  a  spatially 
varying  target  diffusion  and  offer  a  framework  for  a  better  understanding  of  nonlinear  PDEs 
and  corresponding  diffusions.  We  first  give  a  brief  introduction  to  controlled  diffusion,  and 
subsequently  apply  the  latter  to  obtain  a  different  and  insightful  respective  on  nonlinear 
diffusion. 

3.1  Definition  of  Nonlinear  Controlled  Diffusion 

Definition  2.  If  the  drift  and  diffusion  terms  h(t,x),a(t,x)  of  SDE  Eq.  (2.2.3)  are  asso¬ 
ciated  with  a  function  v(t,x),  and  are  denoted  by  h(t,x,v(t,x)),a(t,x,v(t,x)),  Xf  =  Xt  is 
called  a  controlled  diffusion,  where  Xf  satisfies  the  SDE 

dX(  =  bit,  Xt,  vf,  Xt))dt  +  af,  Xt,  vf,  Xt))dBt.  (3.1.1) 

where 

1  (9^  ^  d 

=  2  +  (3.1.2) 

i,j=l  ^  ^  i=l  ^ 

and  af,  x,  vf,  x))  =  af,  x,  vft,  x))ait,  x,  vf,  x))^ . 

Following  Kolmogorov’s  backward  and  forward  equations,  we  will  have  corresponding 
theorems  if  a  number  of  conditions  are  satished.  Here  we  only  list  some  typical  solutions  of 
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the  PDE  with  an  inhnitesimal  operator  as  given  in  Eq.  (3.1.2) 

=  L-U(x) 
dt  *  ^ 

Us{x)  =  f{x)  for  some  0  <  s  <  t, 

Its  solution  may  be  written  as  an  expected  value  Ut{x)  =  E^{f{Xt)} 
[25].  Another  backward  equation  has  the  following  form: 


(3.1.3) 
I  f{y)P{s,x,t,dy) 


dUt{x) 

dt 

Urix) 


m'Utix) 


=  0 

fix) 


for  some  T  >  t, 


(3,1.4) 


where  L^' (■)  is  another  adjoint  operator  of  Lj’  similar  to  that  given  in  Eq.  (2.2.13). 
The  solution  of  this  equation  can  again  be  expressed  as  Utix)  =  E{f{Xstix))),  namely, 
a  probabilistic  mean  value  of  a  reverse  time  process  Xi,0  <  t  <  T  where  T  is  the  hxed 
terminal/end  time  of  the  diffusion  (or  initial  in  the  case  of  an  inverse  diffusion).  Note  that 
a  backward  diffusion  can  be  viewed  as  a  forward  diffusion  by  merely  selecting  t’  =  T  —  t  as 
stated  in  Th.  6. 

we  need  to,  however,  mention  here  that  there  are  many  properties  of  nonlinear  controlled 
diffusions  which  remain  as  open  problems.  Our  approach,  here  is  to  use  the  properties  of 
linear  diffusion  as  an  inspiration  for  investigating  some  practical  nonlinear  problems  whose 
numerical  implementation  is  of  primary  concern. 


3.2  Nonlinear  Diffusion  and  PDE 


As  noted  in  the  previous  chapter,  the  equivalence  between  a  Gaussian  hlter  and  heat 
equation-based  evolution,  led  Witkin  [95]  to  propose  the  following  equation  for  hltering 
a  noisy  observation  of  a  signal/image  fix),  where  we  hereafter  denote  x  G  as  x,  namely 
X  =  {xi,  -  ■  ■  ,  Xd),  to  separate  a  vector  in  TZ^  from  a  scalar  in  TC, 

dLfAx) 


dt 


=  AUtix), 


(3.2.1) 


where  Utix)  denotes  the  data  at  scale  t  G  7^^,  and  Uq{x)  =  f{x)  is  the  initial  data,  where 
X  =  X  eTC  denotes  a  noisy  signal  f\x)  taking  value  in  1-D  space,  while  x  =  ixi,X2)  G  R? 
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implies  spatial  coordinates  of  individual  pixels  of  an  image.  “A”  is  the  Laplacian  operator 
32  32  32 

(i.e.,  7^  when  d  =  1  or  7-^  +  7-^  when  d  =  2).^  The  function  UL,  •)  at  the  hnest  scale 
ox^  0x1  0X2 

(t  =  0)  is  assumed  to  be  comprised  of  a  signal/image  of  interest  and  of  a  white  Gaussian 
noise  of  variance 

Using  the  linear  heat  equation  as  their  paradigm,  Perona  and  Malik([75])  proposed  to 
modify  the  evolution  in  Eq.  (3.2.1)  so  as  to  achieve  maximal  smoothing  in  homogeneous 
regions  of  an  image  to  eliminate  noise,  and  minimal  smoothing  in  high  gradient  regions  to 
preserve  edges.  The  proposed  evolution  equation  which  will  be  central  to  our  development,^ 
is  written  as. 


dUt{x) 

dt 


=  dzv{g{\V{Ut{x))  l)VUi(T)), 


(3.2.2) 


where  “div”  represents  the  divergence  operator,  V  is  the  gradient  operator,  and  g{-)  is 
some  measure  of  ’’edginess”,  thus  a  functional  which  modulates  the  strength  of  the  diffusion 
according  to  the  above  paradigm  (i.e.,  positive  and  monotonously  decreasing  with  5f(0)  =  1). 
One  possible  choice  is  g{v)  =  e  ^  where  K,  a  parameter  to  be  judiciously  chosen,  determines 
the  rate  of  decay  and  thus  the  extent  of  smoothing  of  Ut{x)  for  a  given  gradient  size.  Because 
of  space  limitations,  mathematical  details  as  well  as  numerous  other  improvements  on  the 
P-M  equation  will  not  be  discussed  and  deferred  for  instance  to  [89]. 

To  solve  Eq.  (3.2.1),  we  use  the  stochastic  interpretation,  where  the  underlying  particle 
motion  is  a  Brownian  motion,  and  for  which,  according  to  Th.  3  and  Th.  6  of  Chapter  1,  we 
proceed  to  write 

Ut{x)  =  EoMi^t)),  (3.2.3) 


with  Ut{x)  =  f{x).  We  also  write 


Utix)  =  E(/(W*(f)))  =  E,,^(/(Wi)), 


(3.2.4) 


where  Et^^  specihes  that  the  inverse  diffusion  Xst  is  beginning  from  x  at  time  t.  namely, 
U{t,x)  =  f{x)  is  the  initial  data,  and  for  a  specific  inhnitesimal  operator,  such  as  Lt  is 
a  Laplacian  operator,  we  have  Xt  =  Bt,  this  is  due  to  the  homogenous  and  symmetric 
properties  of  Brownian  motion.  However,  for  most  processes  that  don’t  have  these  properties, 
^The  variable  t  in  this  context  and  throughout,  represents  scale  instead  of  time. 

^Note  that  many  good  techniques  have  since  appeared,  and  to  the  best  of  our  knowledge,  all  are  prone 
to  the  same  over-smoothing  problem  which  is  addressed  herein. 
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Eq.  (3.2.4)  is  an  adaptive  probability  solution.  In  particular,  we  obtain  one  step  transition 
as 


=  j  Et-r,y{f{Xs(^t-T)))p{r,x,y)dy 

=  j  U{t-T,y)Pr{y\x)dy.  (3.2.5) 

Note  that  the  above  diffusion  Xt  is  a  backward  diffusion  which,  as  mentioned  in  Chapter 
1,  is  treated  as  a  forward  diffusion  for  ease  of  exposition  and  in  the  interest  of  clarity.  The 
probability  Pr{y\x)  should  then  be  interpreted  as  P{Xt-r  =  y\Xt  =  x)  for  a  homogenous 
process  and  to  emphasize  the  backward  evolution  in  time/scale. 

3.3  Nonlinear  Diffusion  on  a  Lattice 

In  light  of  the  foregoing  development,  and  for  better  insight  and  intuitive  clarity,  we  find  it 
useful  to  carry  out  most  of  the  exposition  and  the  analysis  in  a  discrete  setting  and  hence 
carry  out  the  computation  on  a  discrete  lattice.  Prior  to  delving  into  our  formulation  and 
interpretation  of  a  Non-Linear  (NL)  diffusion,  we  present  an  illustrative  example  where  the 
so-called  controlled  diffusion  leads  to  a  Markov  Chain  following  an  an  optimization  problem. 

3.3.1  Discrete  Approximation  of  Diffusion 

As  previously  noted,  our  chief  interest  here  is  to  propose  a  framework  within  which  a  stochas¬ 
tic  interpretation  of  a  diffusion  (or  more  generally  of  the  so-called  scale  space  analysis)  is 
achieved,  and  is  in  turn,  instrumental  in  gaining  insight.  Towards  that  end  and  to  further 
extend  and  possibly  improve  on  existing  techniques,  we  begin  by  discretizing  the  space  as 
well  as  the  scale/time  variables. 

Recall  that  a  symmetric  one-dimensional  (1-D)  random  walk  is  well  known  to  converge  to 
a  Brownian  motion  as  r  — 0  and  <5  — >  0,  with  r,  5  respectively  denoting  scale  and  distance 
discrete  step  size.  A  particle  following  such  a  trajectory  will  move  on  a  1-D  lattice  with 
probability  1/2  to  the  left  or  to  the  right,  while  on  a  2-D  plane,  it  will  move  to  any  of  the 
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four  nearest  neighbors  (east,  west,  north,  south)  with  equal  probability  of  1/4.  Formally,  in 
2-D  space,  we  write  the  spatial  variable  {xii,X2i)  =  {xi  +  18^X2  +  iS)  with  i  E  Z  and  the 
scale  tn  =  nr  with  n  G  N,  we  denote  the  one  step  transition  probability  of  a  particle  from 
initial  position  [x^^x^)  to  {xi,X2)  at  the  scale  step,  by  Pni^x^jX^),  {xi,X2))-  As  a  result, 
we  obtain  a  standard  form  from  Eq.  (2.2.14),  namely  the  probability  of  a  particle  being  at 
{xi,X2)  at  scale/time  (n  +  1)**  step  as  r  — 0  and  5  — >  0, 


Proposition  1.  The  following  discrete  equation, 

Pn+l{{x\,  xl),  (xi,  X2))  =  \Pn{{Xi,  xl),  {Xi  -  5,  X2)  +  |Pn((a:?,  X^),  (^i  +  5,  X2)) 

+  \Pn{{x\,  xl),  {Xi,X2  -  5))  +  \Pn{{Xi,  x^),  (xi,  Xa  +  5)) 

(3.3.1) 


converges  to 


X, 


dpt{{Xi,X^2),{xi,X2))  _  d^Pt{{x^^,xl),{xi,X2))  ,  d^Pt{{x^^,X°2),ixi,X2)) 

dt  dx\  dxl  ^  ^  ^  ^ 

Proof:  Subtracting  Pn{,{,x\,  x\) ,  (xi,X2))  from  both  sides  of  Eq.  (3.3.1)  and  dividing  it  by 
we  obtain 


\pn+l{{x\,  0:2),  {xi,  X2))  -  Pn{{Xi,  xl),  {Xi,  X2))]/t  = 

—  [pn{{xl,xl),{Xi  -8,X2))  -2pn{{x^i,xl),{Xi,X2))  + 

Pniix^l^xl),  (Xi  +  (5,  0:2))]  + 

^  [Pn{{x[,X2),  {Xi,X2  -  8))  -  2pn{{xl,xl),  {Xi,X2))  + 

PniiXi,xl),  {Xi,X2  +  5))] 

which  upon  letting  r  =  <5^/4  and  <5  — 0,  concludes  the  proof.  ■ 

With  numerical  implementation  of  a  linear  diffusion  in  hand,  we  proceed  to  consider  a 
1-D  Brownian  motion  on  a  compact  interval.  By  dehning  a  reflecting  wall  on  this  interval,  we 
make  the  resulting  Markov  Chain(MC)  aperiodic  and  recurrent  with  a  solution  to  AU {x)  = 
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0,Ut{x)  =  f{x)  taking  the  form  U(x)  =  Eg(f(XT)).  This  also  implies  a  discrete  solution 
U (T)  =  '^^Pif{xi)  where  Pi  is  the  probability  of  a  particle  to  be  in  state  i  as  t  grows  large. 

i 

The  independence  of  the  solution  U (x)  of  the  initial  state,  implies  its  convergence  to  some 
mean  value  of  of  f(x)  as  x  is  averaged  over  all  possible  paths.  This  in  a  sense  provides  an 
intuitive  justihcation  for  the  convergence  of  a  heat  equation  to  a  constant.  A  case  in  point 
arises  when  we  are  faced  with  a  deterministic  diffusion  Xt,  which  is  alternatively  expressed  as 
dXt  =  (1  0)dt.  The  corresponding  solution  obtained  from  Eq.  (3.1.3)  is  17(T)  = /(a;^ +  t, 
where  xq  =  (x^jX^)  is  the  initial  state,  clearly  non-constant  as  expected. 


3.3.2  Finite  Markov  Chain  Example 


Let  a  hnite  Markov  chain  [27]  as  in  Fig.  3.1  with  three  possible  states  a,  (3, 7  with  respective 
costs  1,  2,  3.  Our  goal  is  to  select  a  “best”  strategy  (minimum  cost)  for  a  particle  to  make 
a  two-step  transition  from  a  state,  say  a.  The  corresponding  transition  probabilities  are 
obtained  from  the  following  sets  of  strategies: 


Gp 


rA  Is  . 

rA  2,  , 

^  3’  3  ’ 


1  1 

2’ 2’ 
3  1 

4’ 4’ 


0)} 

0)} 


For  state  a  for  instance,  two  choices  are  possible: 


•  the  first  strategy  has  it  move  to  state  7  with  probability  and  remain  stationary  with 
probability 

•  the  second  lets  it  move  to  state  (3  with  probability  ^  and  remain  stationary  with 
probability 

Taking  into  account  the  respective  costs  as  well  as  the  transition  probabilities,  an  optimal 
strategy  for  a  is  determined  to  be  (|,  |,0)  with  a  minimum  cost  of  |,  while  in  state  f3  a 
cost  of  I  with  the  strategy  (|,|,0),  and  in  state  7  we  obtain  a  cost  of  1  with  strategy 
(1,  0,  0).  A  second  step  transition  may  be  similarly  found,  with  respective  costs  for  a,  (3, 7,  of 
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Figure  3.1:  Finite  Controlled  Markov  Chain  modelling. 

I  resulting  from  (|,  0,  |),  (|,  0,  |),  (|,  0,  |)  respectively.  This  process  may  be  continued 
indehnitely. 

Note  that  more  complex  strategies  are  possible  and  may  be  constructed  for  the  interme¬ 
diate  steps,  e.g.,  variable  strategies  along  the  steps,  etc..  When,  on  the  other  hand,  a  given 
strategy  set  only  depends  on  the  previous  state,  it  is  referred  to  as  a  Markov  strategy.  It 
is  also  clear  from  the  foregoing  example  that  the  resulting  process,  by  way  of  its  transition 
strategy,  influences  the  overall  mean  value. 

This  example  provides  a  motivation  to  pursue  such  an  approach  of  controlling  diffu- 
sion(drift  and  diffusion  coefficients)  and  sufficient  evidence  for  it  to  be  a  promising  and 
systematic  way  of  addressing  problems  in  image  enhancement/segmentation,  and  shedding 
light  on  the  current  outstanding  problems  in  nonlinear  diffusion. 

3.3.3  Discrete  Time/Scale  Evolution 

By  discretizing  x  and  t,  we  can  account  for  a  reverse  time  evolution  by  relabelling  time 
“f  —  r”  by  1  (or  r)  and  “t  —  nr”  by  n  (  or  nr  ),  hence  making  a  backward  diffusion  equation 
look  more  like  a  forward  diffusion.  We  denote  by  Un{x)  the  value  of  the  solution  at  time 
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step  nr  and  location/state  x.  Eq.  (3.2.5)  can  then  be  written  in  the  form, 

Lfn+i{x)  =  ^Lfn{y)Pn+i{y\x).  (3.3.3) 

y 

where  Pn+i{y\x)  denotes  the  probability  for  a  particle  to  move  to  y  at  step  n  +  1  with  an 
initial  position  x  at  step  n.  Since  the  solution  to  Eq.  (3.3.2)  is  a  Gaussian  transition  density 
function,  it  characterizes  the  evolution  of  a  particle  along  a  Brownian  trajectory  starting  at 
Xq  and  time  t.  Using  the  fact  that  a  limiting  process  of  a  random  walk  is  a  Brownian  motion, 
we  may  compute  the  solution  to  Eq.  (3.3.3)  at  any  desired  discrete  time/scale.  At  the  first 
time  step  r  and  for  a  1-D  case,  we  can  write 

Hi(x)  =  ^f(x-6)  +  ^f(x  +  S),  (3.3.4) 

while  in  a  2-D  scenario,  we  have 

fill 

Hi(xi,X2)  =  -f(xi  -  d,X2)  +  -  f(xi  +  d,X2)  +  -  f(xi,X2  -  S)  +  -  f(xi,X2  +  S), 

(3.3.5) 

both  of  which  are  the  result  of  an  averaging  process.  More  generally,  we  can  respectively 

write  the  1-D  and  2-D  solutions  to  the  linear  heat  equation  as  discrete  expectations 

Hn+lix)  =  ^Hn(x  -  S)  +  ^Un(x  +  S),  (3.3.6) 

111 
l7n+l(Xl,X2)  =  -Hn(Xl  -  S,X2)  +  -Hn(Xl  +  S,X2)  +  -Hn(Xl,X2  -  S)  + 

^Hn(xi,X2  +  S)  (3.3.7) 

Due  to  the  underlying  random  walker  moving  to  its  neighbor  with  probability  1/2  in  1-D 
(and  to  its  four  nearest  neighbors  with  probability  1/4  in  2-D),  it  is  clear  that  the  linear 
evolution  will  indiscriminately  smooth  away  sharp  features  along  with  the  noise. 

3.3.4  A  Stochastic  View  of  Perona-Malik  Equation 

As  noted  earlier,  a  linear  stochastic  differential  equation  leads  to  a  linear  diffusion  by  way 
of  a  Laplacian  as  its  corresponding  inhnitesimal  generator.  Using  this  development  as  an 
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inspiration  together  with  its  discrete  stochastic  formulation  and  interpretation,  we  proceed 
in  an  analogous  manner  to  rewrite  the  P-M  equation  to  be  interpreted  as  a  particle-based 
diffusion. 

Proposition  2.  Based  on  a  particle  system  interpretation,  P-M  equation  may  be  rewritten 
as 


Un+l{x)  =  p'^~^^{x,x  +  5)Un{x 5)  p'^~'~^{x,x  -  5)Un{x  -  5)  -\- 

[1  —  p'^'^^{x,  a;  +  5)  -I-  p'^'^^{x,x  —  6)]Un{x).  (3.3.8) 

Proof:  The  proof  follows  immediately  from  discretizing  Eq.  (5.2.1)  and  rewriting 
l/2g  (I  Un{x  ±  (5)  —  Un{x)  I)  =  p'^^^{x,x  ±  5)  =  p"'~^^{fn+i  =  X  ±  5  \  fn  =  x)  to  denote  the 
transition  probability  of  a  Markov  chain  {'C(-)}  move  from  state  x  to  state  x  ±  5. 

A  similar  expression  for  a  2-D  signal  (image)  may  be  written  as 

Un+l{Xi,X2) 

=  Ps'^^{Xl,X2)Un{Xi  +  5,X2)  +  pf^^  {Xi,  X2)Un{Xi  -  Xs) 

+Pe^^{Xi,  X2)Un{Xi,  X2  +  S)  +  X2)Un{Xi,X2  -  S) 

+  [^-pT^iXuX2)  -pf/^{Xi,X2)  -  p'ff^^{Xi,X2)  -  Pw^iXi,X2)]Un{Xi,X2), 

(3.3.9) 


where 


and 


p'f^\xi,X2)  =P%^\xi,X2)  =pI~^\xi,X2)  =Pw\^i,^2)  =  qf/d  Vf/  I) 


1 


^n+l 


n+1/ 


S7U  1=  JVUf  +  VUf  +  Vf/|  +  Vf/| 


(3.3.10) 


Vf/l  =  Un{Xi  +  S,X2)  -Un{Xi,X2) 
Vf/2  =  Un{Xi-  5,X2)  -Un{Xi,X2) 
Vf/3  =  Un{Xi,X2  +  S)  -  Un{Xi,X2) 
Vf/4  =  Un{Xi,X2-S)-Un{Xi,X2) 
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The  probabilities pg+^(a;i,a;2)  (  resp.  Xs),  X2),  p'^^{xi,X2),  Ps'^^{xi,X2)) 

represent  the  transition  probabilities  of  the  underlying  Markov  chain  he.,  p^'^^{xi,X2)  = 
Ps{^n+i  =  {xi  +  5^X2)  I  in  =  {.Xi,X2))  (similar  expressions  for  other  direction  transition 
probabilities),  which  says  that  activities  of  diffusion  in  4  directions  are  uniform  in  the  same 
location  but  different  and  decided  by  the  local  gradient  measure  in  different  locations,  the 
variational  functional  for  (i(in  special,  d  =  2)  that  corresponded  to  the  P-M  equation  is  given 
as 

=  -  I  (1  —  exp{—  I  Vm  / K})  dx  (3.3.11) 

2  jRd 

The  widely  used  implementation  of  the  PM-algorithm  whose  interpretation  herein  is  given 
below,  is  simpler  and  better  adapted  to  image  processing. 

p"s*\xuX-2)  =  (I  VC/ll), 
pT(xuX^)  =  \g(\VU.,\), 

pT{xuX2)  =  tj(|VC/3|), 

pT(xuX2)  =  tj(|VC/4|) 

These  equations  are  intuitively  appealing,  in  that  the  random  walk  of  a  particle  (or 
pixel)  (or  the  diffusion)  in  takes  place,  in  each  direction,  according  to  the  prevailing  one 
sided  gradient  at  position  {xi,X2)  in  any  of  the  four  directions.  At  time  step  n  +  1,  a 
south  (resp.  north,  east,  west)  moving  walk  takes  place  with  probability  p^'^^{xi,X2)  (resp. 
p^^{xi,X2),  p^^{xi,X2),pi^^{xi,X2)),  and  the  particle  remains  in  place  with  probability 
pT\^)  =  ^-pT^{xi,X2)  -p)(r+^(a;i,a;2)  -p^it^{xi,X2)  -Pw^{xi,X2). 

The  variational  formulation  which  yields  the  above  measures  in  2-D  is 

^  /„4  E  (1  -  “P  {-  (|r)  /A'})  *  (3-3-12) 

Strictly  speaking.  This  expression  is  different  from  that  of  P-M  in  Eq.  (5.2.1)  as  only  the 
gradient  of  each  individual  component  affects  the  transition.  According  to  Eq.  (2.4.1),  we 
have  f{xi,  0:2,  ••  •  ,  Xd)  =  ^  which,  when  used  for  E{u),  yields  the  following 
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Gateauex  difference, 


sE(u)= 

J^x^o  X 

=  —  I  diviy fiyu))hdx  (3.3.13) 

Jn 

using  the  derivative 

f'i  {xi,  ■■■  ,Xd)  =  (3.3.14) 

and  neglecting  the  constant  coefficient  K  on  the  right  hand  side  of  the  formula,  we  have 


dw{Vf(u,Vu))yyy 


i=l 


Y. 

i=l 


dxj  V  dx. 


(3.3.15) 


The  corresponding  steepest  gradient  descent  is 


du 

~di 


(3.3.16) 


It  is  clear  here  that  the  transition  probability  of  such  a  random  walk  is  determined  by 
the  gradient,  inducing  the  desired  control  on  the  diffusion.  This  is  in  sharp  contrast  to  the 
linear  diffusion  where  the  random  walk  invariably  takes  place  with  a  constant  probability  of 
1/4.  Note  that  while  the  derivation  of  an  exact  SDE  corresponding  to  P-M  equation  as  an 
inhnitesimal  generator,  is  interesting  in  it  and  of  itself,  a  more  complex  system  of  particles 
which  is  of  little  relevance  to  our  stated  goal  in  this  paper,  is  required. 


3.4  Two  Sided  Gradient-Driven  Diffusion 

As  discussed  in  Section  2,  at  each  scale  of  our  analysis,  the  mean  value  of  the  process  17(-,  •) 
is  evaluated  as  a  result  of  a  non-homogeneous  random  walk  with  the  transition  probability 
controlled  by  the  underlying  process  at  the  previous  scale.  In  addition,  and  to  avoid  poten¬ 
tial  stability  problems,  we  ensure  that  the  probability  of  a  jump  of  a  particle  (pixel)  farther 
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than  an  immediate  neighbor  is  zero,  which  effectively  emulates  a  continuous  diffusion.  Fur¬ 
thermore,  we  ensure  that  there  always  be  a  one  step  transition  of  a  particle  to  its  neighbors 
to  avoid  a  slowdown  in  convergence  due  to  likely  stationary  states  [56].  We  thus  adopt  this 
paradigm  to  construct  a  non-homogeneous  Markov  chain  whose  transition  probabilities  are 
based  on  the  current  particle  states  and  their  functional  value.  This  results  in  a  set  of  con¬ 
secutive  transition  steps  through  scales,  each  in  a  sense,  dehning  a  new  random  process  with 
a  new  probability  transition. 

While  the  goal  in  signal/image  processing  is  to  maximally  smooth  out  the  noise,  we  are 
also  keen  on  achieving  a  solution  that  is  as  faithful  as  possible  to  the  initial  underlying 
signal.  To  thus  help  better  localize  the  homogeneous  regions  together  with  their  boundaries, 
we  use  in  our  transition  dynamics  a  bidirectional  gradient-based  “probability  measure” .  (sub¬ 
gradient  in  continuous  space).  Using  the  Szokefalvi-Nagy’s  inequality [72],  to  optimize  the 
gradient  energy  (to  delineate  regions),  we  have  to  minimize  the  following  energy  expression. 


S{Un+l)  =  J2S{Un+l{x)) 

X 

=  5^[(^n(a;  +  5)  -  Un{x)){Un+i{x)  -  Unix  -  5))]^ 

X 

+  [{Unix  -5)-  Unix))iUn+lix)  -  Unix  +  5))]^  (3.4.1) 

where  Un+iix),  assumed  to  result  from  Eq.  (3.2.5)  is  written  as, 

Un+iix)  =  P^+\x,  X  -  6)Unix  -  S)  +  P^+\x,  X  +  6)Unix  +  6)  (3.4.2) 

with  P"'~^^ix,x  —  (5)  -|-  P^~^^{x,x  -f  (5)  =  1.  Minimizing  Eq.  (3.4.1)  entails  an  appropriate 
choice  of  a  probability  measure  as  follows. 


Theorem  8.  The  transition  probability  solving  Eq.  (3.4-1)  is  given  by  P^~^^ix,x  —  S)  = 
P{U+i  =  X  -  =  x}  with 


P^^\x,x-5) 


Unix  +  5)- 

-  Unix) 

2 

1 

1 

Ti 

hi 

-  Unix) 

|2  + 

Unix  +  5)-  Unix) 

2 

(3.4.3) 


where  Un+iix)  satisfies  Eq.  (3.4-2). 
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(See  Appendix  A  for  a  proof).  ■ 

For  a  2-D  image,  we  denote  the  transition  probability  by  p^'^^{xi,X2)  =  P{.Cn+i  =  {xi  + 
^,X2)\^n  =  {.Xi,X2)}  (similarly  for  other  probabilities)  and  obtain  the  following  expression 
for  the  transition  probability 

pV'^ixi,  X2)  =  ^ (3.4.4) 

^  1’  s  +  Af  +  S  +  W'  ^  ^ 

where 

X  =  \Un{Xi-5,X2)-Un{Xi,X2)\‘^, 

S  =  \Un{Xi+5,X2)-Un{Xi,X2)\^, 

S  =  \Un{Xi,X2  +  S)  -Un{Xi,X2)  1“^, 

W  =  \  Un{Xl,X2  -  5)  -  Un{Xl,X2)  . 

Using  the  above  transition  probability,  our  newly  proposed  diffusion  is  written  as 

Un+l{Xi,X2)  =  Un{Xi  +  5,X2)Ps'^^{Xi,X2)  +  Un{Xi-  5,X2)p'}/^{Xi,X2) 

Un{Xi,  X2  +  S)Pe^^{Xi,  X2)  +  Un{Xi,  X2  -  S)p'^^{xi,  X2) 

(3.4.5) 


3.5  Discussion 

As  noted  in  the  previous  section,  the  P-M  diffusion  is  driven  by  a  one-sided  gradient  at  any 
position  X,  which  implies  that  weak  smoothing  takes  place  in  the  presence  of  a  relatively 
high  gradient,  even  if  the  latter  is  caused  by  noise.  On  the  other  hand,  when  we  consider 
a  two  sided  gradient  at  a  position  {xi,X2),  we  are  better  able  to  identify  a  noise-induced 
high  gradient  at  that  position,  as  it  is  likely  to  register  a  high  value  on  both  sides  of  the 
pixel  {xi,X2)  under  consideration.  This  would  then  allow  us  to  discriminate  between  a 
“true”  high  gradient  and  that  caused  by  noise.  This  is  in  contrast  to  the  P-M  hlter  which 
relies  on  a  one-sided  gradient  which  calls  for  a  no  transition  state  in  the  Markov  chain,  thus 
preserving  the  prevailing  singularity.  Note  that  this  technical  difficulty  of  the  P-M  equation 
may  further  be  compounded  in  that  the  transition  policy  in  the  Markov  chain  may  eliminate 
true  edges;  since  at  a  position  x  close  to  the  leading  edge  of  a  signal,  and  if  at  position 
{xi  +  S,X2),  U{-,Xi,X2)  is  close  to  U{-,Xi  +  S,X2),  the  probability  of  transition  is  hnite,  and 
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the  smoothing  of  this  leading  edge  takes  place.  This  scenario  is  a  zero  measure  event  when 
a  two  sided  gradient-based  transition  probability  is  used  in  the  policy. 

Using  a  Markov  chain  we  can  thus  model  these  dynamics  via  transition  probabilities 
which,  as  we  mentioned,  may  be  specihed  in  terms  of  the  ratio  of  a  bidirectional  gradient. 
We  should  note  that  in  the  cases  where  the  sub-gradients  are  very  small,  and  hence  little 
signihcance  can  be  attached  to  their  ratio,  the  motion  of  the  particle  is  based  on  a  symmetric 
random  walk,  i.e.,  with  a  1/4  probability  it  moves  to  the  four  nearest  neighbors,  hence  leading 
to  a  linear  heat-like  equation. 

When  on  the  other  hand,  we  use  a  two-sided  gradient,  a  discontinuity  arises  and  the  two 
gradients  are  very  different,  at  least  one  of  them  grows  large  which  blocks  the  diffusion  in 
the  other  direction.  It  can  also  be  seen  from  the  transition  probability  expressions  that  the 
diffusion  is  driven  to  zero  when  a  plateau  is  reached,  i.e.,  a  stable/fixed  point  at  a  staircase 
function  [49]. 

3.6  Experimental  Results 

Our  goal  in  this  section  is  to  substantiate  the  results  that  we  have  established  in  the  previous 
sections.  The  stabilization  of  the  proposed  diffusion  at  staircase  functions  together  with  the 
subsequent  denoising  and  segmentation  effects  are  first  demonstrated  by  running  a  noise  free 
signal/image  which  remains  unaffected  by  the  diffusion  as  displayed  in  Figure  3.3. 

To  establish  a  basis  for  performance  comparison  with  the  P-M  equation  which,  recall, 
was  the  source  of  inspiration  for  our  proposed  technique,  we  run  experiments  where  both 
visual  as  well  as  quantitative  assessments  are  inferred.  The  denoising  performance  can  be 
evaluated  visually  as  shown  in  Figures  (3.4,  3.5,  3.6).  Figure  3.4  demonstrates  a  segmenta- 
tion/denoising  of  an  infra-red  real  image  of  a  boat,  run  to  similar  time/scale  for  both  P-M 
and  our  proposed  technique.  This  class  of  images  is  well  known  for  posing  great  challenges 
to  simple  gradient-based  segmentation  and/or  linear  hltering,  and  this  is  for  the  most  part 
due  to  their  impulsive  nature.  The  results  of  such  approaches  usually  results  in  visually 
unpleasant  and  quantitatively  inaccurate  results  if  at  all.  In  Fig.  3.6,  the  potential  for  a 
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0  20  40  60  80  1 00  0  20  40  60  80  1 00 

newly  proposed  technique  P-M  technique 


0  20  40  60  80  1 00  0  20  40  60  80  1 00 


Time  Time 

Figure  3.2:  Noisy  signal  filtered  by  our  random  walk  algorithm  and  PM  algorithm. 


complete  diffusion  (for  a  non-optimal  choice  of  the  threshold  parameter  in  the  P-M  equa¬ 
tion)  is  demonstrated  whereas  and  as  shown  above,  the  new  approach  will  stabilize  with  no 
parameter  adjustment.  For  a  well  known  stopping  time  and  well  chosen  parameter,  P-M 
approach  performs  quite  well.  This  may,  however,  turn  out  to  be  a  limitation  for  a  number 
of  real  applications,  as  it  is  generally  not  known  what  the  image  consists  of. 

In  Fig.  3.7  an  enhancement /deblurring-like  effect  using  our  algorithm  is  also  demon¬ 
strated  for  a  checker  board.  The  cost  of  doing  away  with  the  explicit  knowledge  of  the 
stopping  time  for  our  approach,  is  the  arising  block  effects  which,  although  common  to 
many  existing  techniques,  remain  a  drawback.  Although  this  may  be  hne  tuned  away  for 
pure  denoising  purposes,  we  discuss  in  the  next  chapter  techniques  which  specihcally  ad¬ 
dress  such  a  shortcoming.  In  Fig.  3.7,  a  de-blurring  example  is  shown,  demonstrating  the 
capacity  of  the  algorithm  to  enhance  edges  and  again  to  stabilize  at  staircase  functions.  In 
Figure(  3.8),(  3.9)  we  demonstrate  denoising  and  segmentation  results  of  images,  which  can 
serve  as  a  comparison  of  our  new  algorithm  and  the  P-M  algorithm. 

For  establishing  a  more  quantitative  measure  of  performance  we  use  the  hgures  in  Fig¬ 
ure  3.10.  A  pixel  deviation  is  computed  and  an  error  rate  is  dehned  as  an  unmatched 
segment(  in  the  meaning  of  region  segment)  between  hltered  image(  or  noisy  image)  and 
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clear  image,  the  error  rate  curve  which  is  consistent  with  our  visual  assessment  is  displayed 
in  Fig.  3.11. 


3.7  Conclusion 

The  proposed  stochastic  interpretation  together  with  its  link  to  controlled  diffusion  are  shown 
to  not  only  explicate  existing  techniques  and  their  limitations,  but  to  also  provide  sufficient 
insight  to  develop  other  novel  physically  and  geometrically  driven  methodologies.  We  have 
also  succeeded  in  resolving  in  part,  a  well  known  and  long  standing  problem  of  unknown 
stopping  criterion. 
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3.8  Appendix  A 

Proof-.  We  first  proceed  to  re-express  S(Un+i)  in  terms  of  Eq.  (3.4.2)  and  Pn+i{x,x  -f  5)  = 
1  —  Pn+i{x,  X  —  S).  By  subsequently  differentiating  with  respect  to  Pn{x,  x  —  S)  and  bearing 
in  mind  that  a  two  sided-gradient  is  used,  we  have  the  following  equation 

d{S{Lfn+i))/dPn+i{x,x  -  (5) 

=  2[{U4x  +  6)-  Un{x)f{Un+l{x)  -  Unix  -  5))]idUn+lix))/dPn+lix,X-  5) 

+  2[iUnix  -5)-  Unix)fiUn+lix)  -  Unix  +  5)  )]  (x)  ) /9P„+1  (x,  X  -  5) 

(3.8.1) 

Setting  diSiUn+i))/dPn+iix,x  —  5)  =  0  ,  and  assuming  a  non-degenerate  case  of 
dUn+iix)/dPn+iix,x  —  S)  ^  0,  the  optimal  transition  probability  at  the  n-th  step  implies 

[{Unix  +  6)-  Unix)f  +  (Unix  -6)-  Unix)f]Un+lix) 

=  (Unix  +  6)-  Unix)fUnix  -  5)  +  (P^X  -  5)  -  Unix)fUnix  +  5)  (3.8.2) 

where  the  replacement  of  Un+ii^)  with  Eq.  (3.4.2)  will  reduce  to  Eq.  (3.4.3).  Note  that 
if  dUn+iix) /dPn+iix.iX  —  5)  =  0,  we  can  see  that  a  left  sided-gradient  is  equal  to  the  right 
sided-gradient  resulting  in  an  optimal  choice  of  probability  of  1/2.  ■ 
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Figure  3.4:  A  noisy  image  together  with  its  enhanced  copy  by  the  proposed  algorithm  and 
by  the  P-M  method  best  result. 
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PM  100k 


Figure  3.5:  Complete  Smoothing  vs  Stability. 
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Figure  3.6:  PM  algorithm. 


Sinusoid  Image  Filtered  Sinusoid  Image 


Figure  3.7:  Checker  Board  Enhancement. 
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Noisy  Image 


Clear  house  image 


Segment  clear  image 


Filtered  Image 


Figure  3.8:  tools  segmentation. 


Noisy  house  image 


Segment  noisy  image 


Segmented  Image 


Filtered  house  image 


Segment  filtered  image 


Figure  3.9:  House  segmentation 
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Noisy  circle  image  segment  of  noisy  circle  image 


Figure  3.10:  Circle  segmentation. 
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Figure  3.11:  Error  rate  in  circle  segmentation  for  different  SNR  scenarios. 


Chapter  4 


Multiscale  Wavelet  space 


Wavelet  as  a  local  signal  analysis  tool  plays  a  very  important  role  in  obtaining  detail  informa¬ 
tion  about  signal  structures.  In  this  chapter,  we  introduce  some  basic  facts  about  wavelets 
that  would  be  useful  for  next  chapter.  Further  details  may  be  found  in  [66]. 


4.1  Definition  of  a  Wavelet 


Let  be  a  space  of  functions  /(x),  x  eTZ,  such  that 


2  — 


\f{x)\‘^dx  <  -l-cx) 


(4.1,1) 


where  the  integral  II/II2  defines  a  norm  of  L^{71),  and  is  also  denoted  by  ||/||  in  this  thesis. 

A  wavelet,  also  known  as  a  mother  wavelet,  is  a  function  ijj{x)  G  L‘^{TV)  that  is  centered 
in  the  neighborhood  of  x  with  ||'^||  =  1  and  zero  mean: 


r*+oo 


'ip{x)dx  =  0 


(4.1.2) 


In  the  spectral  domain,  we  denote  by  '^(cci)  the  Fourier  transform  of  ipix),  and  by  |t/’(c(2)| 
the  amplitude  of  ipiuj).  The  mother  wavelet  may  be  interpreted  as  the  impulse  response  of 
a  band-pass  hlter  since  '^(0)  =  ijj{x)dx  =  0.  The  mother  wavelet  is  dehned  jointly  with 
a  scaling  function  (also  called  father  wavelet),  who’s  mean  is  non- vanishing  and  which  has  a 
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corresponding  low-pass  filter  impulse  response,  namely  0  G  L‘^{TZ)  with 

/+00 

(j){x)dx  =  1. 


(4.1.3) 


its  Fourier  transform  also  satisfies, 


r+cxi  T  /‘+00 

MP=  /  / 

Jl  ^  J  LJ 


\m? 


(4.1.4) 


By  dilating  and  translating  a  wavelet  function,  we  obtain  a  family  of  wavelet  atoms 


( 

V 


X  —  u 


with  ||t/’n,s||  =  1-  The  wavelet  transform  of  a  function  /  G  L‘^{7V)  at  u  and  s  is  dehned  as 


r-\-oo  1 

Wf{u,s)  =<  fAn,s  >=  /  f{x)^ij 


X  —  u 


with 


1  f  — ti 

V’s(m)  =  I  — 

S  \  5 


dx  =  f*'ips{u)  (4.1.5) 


(4.1.6) 


Wf{u,s)  measures  the  variation  of  /  in  a  neighborhood  of  u,  whose  size  is  proportional  to 

s. 

Similarly,  a  family  of  scaling  atoms  may  be  generated  as 


X)  = 


X  —  u 


whose  transform  at  u  and  s 


■>+CO  1 

Lf{u,s)  =<  f,(j)u,s  >=  I 


X  —  u 


dx 


(4.1.7) 


provides  the  low-frequency  approximation  of  /  at  the  scale  s  in  a  neighborhood  of  u. 

By  the  low-pass  and  band-pass  hlters’  properties,  the  wavelet  and  scaling  atoms  play  an 
important  role  in  generating  bases  of  the  approximation  and  detail  spaces  as  shown  in  the 
following  sections. 
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4.2  Wavelet  frames 

The  frame  theory  was  originally  developed  by  Duffin  and  Schaeffer  [24]  to  reconstruct  a  band- 
limited  signal  /  from  irregularly  spaced  samples  {f{tn)}n&z-  A  frame  is  a  family  of  vector 
{0n}ngr  in  a  space  that  characterizes  any  signal  /  from  its  inner  products  {<  /,  0n  >}ner, 
where  T  is  an  index  set.  The  formal  dehnition  of  a  frame  is 

Definition  3.  (Definition  of  Frame):  The  sequence  {0n}ner  is  a  frame  of  a  Hilbert  space  H 
if  there  exist  two  constants  A  and  B  such  that  for  any  f  E  H 

A||/|p<5^|</,0n>p<i?||/ir  (4.2.1) 

ner 

when  A  =  B,  the  frame  is  said  to  be  tight.  Obviously  one  can  see  that  an  orthonormal  basis 
is  a  tight  frame  with  A  =  B  =  1.  These  bounds  may  also  display  the  redundancy  of  a  frame 
representation  of  a  signal.  An  operator  U  dehned  by  U  :  U f[n]  =<  f,(j)n  >,  V  n  G  T  is 
called  a  frame  operator. 

With  {0n}ner  denoting  a  frame,  the  sampling  observed  data  represent  coefficients,  whose 
reconstruction  is  carried  out  using  a  dual  frame  {0n}ner  dehned  by 

fin  =  {U*U)-^fin,  (4.2.2) 

to  yield  /  as 

f  =  Y,<f\^n>fin.  (4.2.3) 

ner 

If  the  frame  {^njner  is  tight,  then  the  dual  frame  {fin}  is  equal  to  {fin}  with  a  scaling 
difference,  which  generally  simplihes  the  numerical  implementation  of  a  frame  decomposition 
and  explains  its  wider  acceptance. 

Wavelet  frames  are  constructed  by  starting  with  a  continuous  wavelet  transform  whose 
translation  and  scale  parameters  are  appropriately  sampled  to  cover  the  time-frequency 
plane  with  corresponding  discrete  wavelet  family.  To  obtain  a  full  cover,  we  sample  the  scale 
parameter  s  along  an  exponential  sequence  {  a^}j&z,  with  a  sufficiently  small  dilation  step 
a  >  1.  The  time  translation  u  is  sampled  uniformly  at  intervals  proportional  to  the  scale  , 
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yielding 

Necessary  and  sufficient  conditions  for  {'4’j,n}{j,n)ez^  fo  be  a  frame  and  have  a  dual  frame 
is  stated  (  see  [21]).  However,  the  sampling  interval  a%o  might  cause  a  translation  distortion 
if  its  value  is  large  compared  to  the  rate  of  variations  of  f  To  construct  a  translation 

invariant  wavelet  representation,  the  scale  s  is  discretized  while  the  translation  parameter  u 
is  not.  The  scale  is  sampled  along  a  dyadic  sequence  The  dyadic  wavelet  transform 

of  /  G  L'^iTZ)  is  dehned  by 


^+oo 


Wf{u,2^)  =<  >=  / 

J —oo  y  2^ 


X 


u 


2i 


dx  =  f  *  {u) 


with 


-u 


2i 


(4.2.5) 


(4.2.6) 


It  can  be  shown  that  the  normalized  dyadic  wavelet  transform  operator  U f{j,  u)  =  W f{u,  2^) 
satishes  frame  inequalities  and  a  reconstructing  wavelet  'ijj  may  be  constructed.  In  practice, 
we  need  to  compute  a  discrete  dyadic  wavelet  transform,  which  may  be  carried  out  by  a  fast 
filter  bank  algorithm  for  an  appropriately  designed  wavelet  and  described  as  follows. 


4.3  Wavelet  basis 

A  wavelet  dilated  by  2^  and  translated  by  2%  for  all  (j,  n)  G  generates  an  orthonormal 
basis  of  L‘^{IZ)  but  distribution  information  at  different  resolutions.  This  is  intimately  related 
to  multi-resolution  signal  approximation  dehned  below. 
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Definition  4.  A  sequence  (yj)j^z  of  closed  subspaces  of  L‘^{1Z)  is  a  multi-resolution  approx¬ 
imation  if  the  following  6  properties  are  satisfied: 

V  (j,  k)  e  fit)  e  V,  ^  fit  -  2%)  e  Vj, 
y  JeZ,VJ+^cV„ 

WjeZJit)EV,^fit/2)eV,+,, 
lim^^+oo  Vj  =  n^=_^Vj  =  {0}, 
limj^_oo  Vj  =  Closure  [uj^^Vj)  =  L^(7^). 

There  exists  9  such  that  6*(t  —  ®  Riesz  basis  ^  o/ Vq. 

We  need  to  mention  that  Vj  characterizes  the  signal  approximation  at  the  resolution  2~\ 
The  approximation  of  /  on  Vj  is  defined  as  the  orthogonal  projection  Pvjf  of  function  /  onto 
the  space  Vj. 

In  order  for  a  family  {6*(t  —  n)}n^z  to  be  a  Riesz  basis  of  the  space  Vq,  the  necessary  and 
sufficient  condition  is  there  exist  R  >  0,  R  >  0,  such  that 


1 

—  < 

B 


+  0O 


-2/c7r)p  < 


fc=  — OO 


(4.3.1) 


This  yields  the  construction  of  a  scaling  function  0  through  Fourier  transform  of  6 it) 

eiu) 


(fiu)  = 


(E:r-ool^(^-2A;7r)P)V2' 


(4.3.2) 


for  which  we  denote 

<Pxnit)  =  ^0  j  •  (4.3.3) 

Scaling  atoms  {(j)j,n}n&z  defined  above  normally  span  a  multi-resolution  approximation 
space  Vj.  As  we  know  that  Vj  C  10- 1,  let  Wj  be  the  orthogonal  complement  of  Vj  in  10- 1: 


10-1  =  10-  ©  R0- 

the  orthogonal  projection  of  /  on  I0-i  can  be  further  decomposed  as 


Pv,.J  =  PvJ®PwJ- 


^Definition  of  Riesz  basis  can  be  found  in  [66]. 
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The  complement  Pwjf  provides  further  ’’details”  of  /  at  scale  2^  and  one  can  construct 
an  orthonormal  basis  of  Wj  by  wavelet  atoms  {t/’yn}ne^)  which  are  obtained  by  sampling  u 
and  s  on  a  2^  grid, 

i’jAx)  =  ^  •  (4.3.4) 

With  {0j>}ne^  and  {'ipjAn&z  as  the  orthonormal  bases  of  Vj  and  Wj,  the  coefficients  of 
/  in  these  spaces  are  given  by 

%N  =< >  and  dj[n]  =<  f,'il)j^n>  (4.3.5) 

The  orthogonal  projection  of  /  over  Vj  and  Wj  are  therefore  obtained  in  the  following 
expressions, 

OO  OO 

Pvjf  —  ^  ^  ^  ^  4^j,n  —  ^  ^  %[^]0j,n5 

n=—oo  n=—oo 

OO  OO 

^j,n  =  5Z  (4.3.6) 

n=—oo  n=—oo 

A  multi- resolution  approximation  {Vj}j^z  is  entirely  characterized  by  the  scaling  function 
0  that  generates  an  orthonormal  basis  for  each  space  Vj,  This  may  in  turn  be  shown  to  be 
implemented  by  a  conjugate  mirror  hlter. 

4.3.1  Wavelet  design:  Connection  to  conjngate  mirror  filters 

Let  h  and  be  a  pair  of  hnite  impulse  response  Liters,  h  is  a  low-pass  hlter  whose  transfer 
function  satishes  h{0)  =  \f2,  where  h{u))  =  Fourier  series  of  the 

discrete  hlter  h[n]. 

Mallat[64]  and  Meyer [69]  point  out  that,  for  an  integrable  scaling  function,  the  Fourier 
series  of  h[n]  satishes 

\h{ujA  vAuj  +  'kA  =  2,  Vo;e7^  (4.3.7) 

and 

h(0)  =  A2.  (4.3.8) 

A  discrete  hlter  whose  Fourier  series  satishes  Eq.(4.3.7)  is  called  a  conjugate  mirror  hlter. 
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Conversely,  they  also  give  necessary  conditions  on  which  a  scaling  function  can  be  con¬ 
structed,  namely,  if  h{uj)  is  27r  periodic  and  continuously  differentiable  in  a  neighborhood  of 
a;  =  0,  and  it  satishes  Eq.(4.3.7)  and  Eq.(4.3.8),  and  if 


inf  \h(u})\  >  0, 

aje[-7r/2,7r/2] 


(4.3.9) 


It  \  TT  ^  lt^\lt^\ 

P=1  ^  ^ 


(4.3.10) 


is  the  Fourier  transform  of  a  scaling  function  0  G  L‘^{TZ).  Further,  we  can  decompose 


^0(^/2)=  Hn\(l)it-n) 


(4.3.11) 


h[n\  =<  ^0(^),0(^-«)  >  • 


(4.3.12) 


Suppose  that  the  Fourier  transform  of  0  is  hnite,  the  corresponding  wavelet  0  is  dehned 
through  a  Fourier  Transform  as 


W)  = 


(4.3.13) 


g{uj)  =  e  +  tt). 


(4.3.14) 


The  necessary  and  sufficient  conditions  on  g  such  that,  for  any  scale  20  {'ipj,n{,x)}n&z  be 
an  orthonormal  basis  of  Wj  and  {0j>}(yn)e.22  be  an  orthonormal  basis  of  L‘^{TZ)  is 


(o;)P  +  |,9(o;  +  7r)P  =  2 


(4.3.15) 


g{uj)h*{uj)  -|-  g{u  -|-  7r)h*{u  -|-  tt)  =  0. 


(4.3.16) 


It  is  shown  that  g{uj)  is  the  Fourier  series  of 


9[n\  =<  >, 


(4.3.17) 
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which  are  the  decomposition  coefficients  of 

1  4  +“ 

(4.3.18) 

^  n=—oo 

for  which,  calculating  the  inverse  Fourier  transform  of  Eq. (4.3.14)  yields 

g[n]  =  {—lY~"'h[l  —  n]  (4.3.19) 

Therefore  5^  is  a  high-pass  conjugate  mirror  filter.  The  conjugate  mirror  filters  h[n],  g[n] 
play  an  important  role  in  the  fast  wavelet  transform  algorithm.  A  fast  filter  bank  algorithm 
is  further  introduced  to  calculate  the  coefficients  of  a  signal  measured  at  a  finite  resolution, 
thus,  the  following  formula  is  further  proceeded  to  Eq.  (4.3.5) 

+CXD 

aj+i\p\  =  h[n  —  2p]aj[n\ 

n=— 00 

+00 

dj+i[p]=  9[n  -  2p]aj[n]. 

n=— 00 

We  can  see  that  signals  are  decomposed  into  low-pass  and  high-pass  components  sub¬ 
sampled  by  2,  and  coefficient  aj\p]  may  be  reconstructed  as 


+  00 


+CXD 


i[p]  =  Hp  - ‘^n]aj+i[n]  +  ^  g[p  -  2n]dj+i[ 

n=—oo  n=—oo 

=  dj+i  ic  h\p]  +  dj+i  -k  g[p] 


n\ 


(4.3.20) 


where  ”  -k  ”  represents  convolution  and 


X  n  = 


x[k]  ii  n  =  2k 
0  n  =  2k  +  1 


4.3.2  Vanishing  Moments  vs.  Snpport  size  of  a  wavelet 

In  order  to  parsimoniously  represent  a  function  /  in  a  wavelet  basis,  fewer  non-zero  coeffi¬ 
cients  are  desirable.  This  is  depend  on  the  regularity  of  the  function  /,  and  on  choosing  an 
appropriate  wavelet,  namely,  wavelet  with  specific  vanishing  moments  and  support  size. 
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A  wavelet  with  n  vanishing  moments  satishes  the  following  formula, 


^+oo 


x^il){x)dx  =  0,  0  <  A;  <  n, 


(4.3.21) 


which  means  that  a  wavelet  if)  is  orthogonal  to  any  polynomial  of  degree  up  to  n  —  1.  This 
also  leads  to  the  following  relation  to  its  Fourier  transform  %[), 


r*+oo 


x^'ilj{x)dx  =  =  0,  0  <  A;  <  n 


(4.3.22) 


where  i  =  yj—l.  This  property  reveals  that,  if  a  wavelet  ip  has  n  vanishing  moments,  tp  and 
its  hrst  n  —  1  derivatives  are  0  at  ci;  =  0.  Therefore,  if  /  G  C'"',  C^  is  a  collection  of  functions 
which  are  n  times  continuously  differentiable  (  i.e.  /  can  be  well  approximated  by  a  Taylor 
polynomial),  then  such  an  analyzing  wavelet  produces  small  amplitude  coefficients  at  fine 
scales  and  one  can  shown  that  the  decay  rate  of  \W f{u,s)\  across  scales  s  is  closely  related 
to  the  exponential  degree  of  the  Lipschitz  regularity  of  /.  This  property  is  used  to  detect 
singularities  by  finding  abscissa  where  modulus  maxima  converge  at  fine  scales,  as  discussed 
in  detail  in  [69,  45,  67]. 

One  may  also  argue  that  the  coefficients  <  /,  t/’yn  >  may  have  large  amplitudes  if  a 
wavelet  tp  has  a  large  support.  To  therefore  minimize  the  number  of  high  amplitude  coeffi¬ 
cients,  we  need  to  reduce  the  support  size  of  tp  as  much  as  possible.  The  constraints  imposed 
on  orthogonal  wavelets  imply  that  if  tp  has  p  vanishing  moments  then  its  support  is  at  least 
of  size  2p  —  1  when  choosing  a  particular  wavelet  [33,  88].  We  hence  face  a  tradeoff  between 
the  number  of  vanishing  moments  and  the  support  size. 


4.4  Daubechies  Wavelets 

Daubechies  wavelets  have  minimum  size  supports  for  any  given  vanishing  moments  of  order 
p.  It  is  shown  that  the  support  sizes  of  a  scaling  function  cp  and  a  wavelet  tp  are  related  to 
the  conjugate  mirror  filter  h  that  are  used  to  construct  them.  The  scaling  function  tp  has  a 
compact  support  if  and  only  if  h  has  a  compact  support,  and  furthermore  their  support  are 
equal.  If  the  support  of  h  and  (p  is  [Ni,  N2\  then  the  support  of  tp  is  [(iVi  —  A"2  -|-  l)/2,  {N2  — 
Ni  -\-  l)/2](see  [64]).  In  addition,  to  ensure  that  a  wavelet  has  p  vanishing  moments,  the 
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Fourier  transform  h  of  the  conjugate  mirror  filter  h,  which  is  used  to  construct  ip,  must  have 
a  zero  of  order  p  at  cu  =  tt,  namely,  h{u)  can  be  written  as 

-  /1  I  P 

=  (4.4.1) 

where  is  a  polynomial  of  minimum  degree  m  such  that  h  satishes 

|h(o;)p  +  |h(o;  +  7r)p  =  2.  (4.4.2) 

As  a  result,  h  has  N  =  m  +  p  +  1  non- zero  coefficients.  Thus  obtaining  a  minimum  support 
wavelet  is  equivalent  to  obtaining  a  minimum  support  h,  and  the  minimum  degree  m  of  R 
required  is  m  =  p  —  1,  see  Daubechies[20]. 

Daubechies  wavelets  are  constructed  by  choosing  a  minimum  degree  polynomial 

m  m 

Rie-n  =  Y.  =  ^0  n(l  -  (4-4-3) 

k=0  k=0 

such  that  =  P(sin^(ci;/2)),  where  P{x)  is  a  polynomial  satisfying 

(1  -  xfP{x)  +  xPP{l  -x)  =  l  (4.4.4) 

It  turns  out  that  Daubechies  wavelets  constructed  above  have  a  minimum  size  sup¬ 
port  equal  to  [— p  +  l,p].  Daubechies  wavelets  are,  however,  very  asymmetric,  as  shown 
in  Fig.(4.1). 

Daubechies  wavelets  demonstrate  that  there  is  a  tradeoff  between  vanishing  moments  and 
the  support  size,  specihcally  the  higher  the  vanishing  moments,  the  larger  the  support  size. 
The  shortest  support  Daubechies  wavelet  is  a  Haar  wavelet,  which  has  a  vanishing  moment 
of  order  1. 

4.5  Wavelet  Packet 

Instead  of  decomposing  only  the  approximation  spaces  Vj  to  construct  lower  resolution  ap¬ 
proximation  space  Vj+i  and  detail  space  hFj+i  by  wavelet  bases,  we  can  also  decompose  the 
detail  space  Wj  into  two  subspaces,  an  approximation  and  a  detail  space.  This  calls  for 
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Figure  4.1:  Daubechies  scaling  function  0  (top)  and  wavelet  '^(bottom)  with  vanishing  mo¬ 
ments  2 
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new  bases,  the  so-called  wavelet  packets,  due  to  Coifman,  Meyer  and  Wickerhauser[79].  The 
analysis  may  be  interpreted  as  spectral  partitioning. 

If  the  signals  are  approximated  at  scale  2^,  we  can  represent  the  recursive  splitting  of 
vector  spaces  in  a  binary  tree.  The  root  of  the  tree  is  associated  to  the  approximation  space 
Vl-  Denote  each  node  of  the  tree  is  labelled  by  (j,  p),  and  is  associated  to  a 

detail  space  hTj,  where  j  —  L  >  0  is  the  depth  of  the  node  in  the  tree  ,  and  p  is  the  number 
of  nodes  that  are  on  its  left  at  the  same  depth  j  —  L.  Therefore,  the  two  children  nodes  of 
(j,  p)  are  orthogonal  subspaces  (Wavelet  packet  spaces)  such  that 

Wf  = 

where  and  may  be  viewed  as  the  approximation  and  detail  spaces  of  hTj  and 

the  two  wavelet  packet  orthogonal  bases  of  the  two  children  notes  are  defined  as 

+  00 

^1+1  (t)  =  ^  h[n]'4)^^{t-Tn)  (4.5.1) 

n=—oo 

and 

+  00 

Y,  9[n]rj{t-2^n)  (4.5.2) 

n=—oo 

with 

h[n]  =<  i/jjliiu),  -  2%)  >,  g[n]  =<  -  2%)  >  (4.5.3) 

This  recursive  splitting  defines  a  binary  wavelet  packet  tree  where  each  parent  node  is 
divided  into  two  orthogonal  subspaces  each  with  concentrated  energy  in  different  frequency 
bins.  The  idea  of  wavelet  packets  generalizes  the  link  between  multiresolution  approximations 
and  wavelets,  and  the  designed  bases  for  wavelet  packets  are  well  adapted  to  decomposing 
signals  that  have  different  behavior  in  different  frequency  intervals,  several  specihc  wavelet 
packet  bases,  such  as  cosine  bases,  block  bases,  lapped  orthogonal  bases,  etc.  are  intensively 
discussed,  see[64,  92,  80,  79,  68,  65]. 


Chapter  5 

Wavelet  Frame-Based  Nonlinear 
Filtering 

5.1  Introduction 

In  spite  of  the  fact  that  the  performance  improvement  as  in  Chapter  3  was  remarkable,  the 
drawback  was  the  loss  of  features  such  as  texture  which,  in  some  class(other  than  those 
investigated  in  [89])  of  images  is  very  important  to  preserve. 

the  nature  of  question  which  then  follows  and  which  is  addressed  in  this  chapter  is  whether 
an  efficient  and  effective  hltering  approach  can  be  made  feature  (e.g.  texture)preserves,  here 
we  address  this  problem  and  show  that  using  wavelet  frames  with  wavelets  of  higher  order 
moments  than  Haar’s  is  tantamount  to  accounting  for  longer  term  correlation  structure  while 
preserving  the  local  focus.  This  hence  yields  an  efficient  tool  in  analyzing  and  enhancing 
images  with  a  careful  account  for  texture  information. 

we  explain  a  connection  between  the  equation  and  the  process  to  Haar  wavelet  coefficient 
to  establish  a  direct  equivalence  between  a  linear  Heat  equation  and  a  Haar  coefficients  based 
evolution. 

In  light  of  this  derivation  shown  in  Eq.  5.4.1,  we  proceed  to  generalize  this  connection 
and  in  fact  derive  wavelet  frame  coefficients,  this  leads  to  a  remarkable  signal  and  image 
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enhancement  while  preserved  features  which  are  important  to  other  approaches,  such  as, 
image  classification  etc. 

5.2  Problem  Statement 

As  noted  above  the  Perona-Malik  equation  still  enjoys  a  great  deal  of  popularity  for  achieving 
a  selective  nonlinear  filtering,  compatible  with  the  desired  objective  of  image  filtering,  namely 
that  homogeneous  areas  be  maximally  smoothed  while  edge  contours  be  maximally  preserved 
(or  equivalently  minimally  smoothed).  It  is  expressed  as 

=  div  (r(\  V  (U(t,x))  I)  V(7((,i)) ,  (5.2,1) 

where  IF{v)  may  be  chosen  as  jF(v)  =  e  ,  K  determines  the  rate  of  decay  and  thus  the 
extent  of  smoothing  of  U {t,  x)  for  a  given  gradient  size.  Many  other  techniques  have  been 
proposed  with  each  addressing  different  aspects  of  the  limitations  of  the  above  equation. 
Specifically,  one  which  addresses  the  stopping  criterion  problem  [50,  49]  may  be  written  for 
simplicity  in  a  1-D  evolution  as  a  Markov  chain  equation,  see  Eq.  (3.4.2). 

To  the  best  of  our  knowledge,  none  of  these  techniques  [89,  50]  resolves  the  problem  of 
texture  loss  alluded  to  in  the  introduction.  The  fact  that  a  first  order  difference  implemen¬ 
tation  of  a  gradient  in  the  selective  filtering  of  an  image  is  a  main  source  of  this  loss  may 
easily  be  observed  by  the  convergence  of  the  data  to  staircase  functions [49]  and  can  be  seen 
in  Fig.  5.1.  Our  goal  in  the  sequel  is  in  effect  to  lift  this  limitation  by  reinterpreting  the  first 
difference  implementation  as  a  Haar  wavelet  coefficient  and  by  subsequently  seeking  a  more 
regular  wavelet  implementation/approximation  as  we  elaborate  further  next. 

5.3  A  Multiscale  Approach  to  Scale-Space  Analysis 

Much  of  the  research  in  nonlinear  diffusion  [89]  has  been  carried  out  for  the  most  part  on 
a  parallel  track  to  and  with  little  interaction  with  all  that  had  been  pursued  in  wavelet  or 
multiresolution  analysis.  This  is  in  spite  of  the  fact  that  both  approaches  are  very  much 
based  on  the  notion  of  scale,  and  that  both  fully  use  information  gleaned  along  it.  Creative 
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A  noisy  signal  and  its  filter  result  with  our  random  walk  algorithm 


Figure  5.1:  A  profile  of  noisy  Lenna  image  and  filtered  result  with  Random  walk 
algorithm[52]. 

and  clever  ideas  germane  to  the  two  philosophies  resulted,  and  distinct  advantages  and 
limitations  emerged.  A  natural  question  which  then  arises  is  on  their  interplay  and  on  any 
potential  gain  which  may  result  if  the  synergy  is  exploited.  Towards  that  end,  we  hrst  recall 
some  facts  about  frames  and  their  role  in  signal  representations  and  subsequent  hltrations. 

5.4  Frame  Representation  and  Reconstruction 

As  is  well  known,  an  image  may  be  well  represented  in  a  Haar  wavelet  frame  by  obviating 
the  dyadic  down-sampling  step  (i.e.  a  redundant  representation)  of  the  usual  orthonormal 
representation.  While  the  latter  representation  is  usually  parsimonious  and  yields  a  perfect 
reconstruction,  it  exhibits  a  visually  noticeable  loss  of  information  in  the  course  of  additional 
transformations  such  as  coefficient  thresholding  for  denoising  or  compression,  etc.  This 
loss  is  particularly  evident  in  an  orthonormal  representation  with  a  short  support  analysis 
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wavelet,  and  may  be  attributed  to  the  fact  that  any  given  transformed  coefficient  has  a 
significant  local  influence  and  hence  impact  on  the  visual  outcome  .  Smoothing  by  coefficient 
thresholding  which  is  usually  popular  in  wavelet-based  denoising  [52,  50],  is  in  contrast  to 
the  softer  and  more  progressive  smoothing  commonly  encountered  in  nonlinear  diffusion  of 
scale  space  filtering  [75].  The  redundancy  of  this  continuous  scale  approach  in  some  sense, 
counterbalances  the  singular  effect  of  an  orthonormal  wavelet  coefficient. 

Our  goal  in  this  section  is  to  establish  for  a  given  signal  an  equivalence  between  smooth¬ 
ing  based  on  the  Heat  equation  and  that  based  on  a  diffusion-emulating  transformation  of 
its  frame  coefficients.  The  wealth  of  available  wavelet  functions  makes  it  possible  for  us  to 
optimize  our  desired  ability  to  capture  longer  term  (higher  than  Markov  of  order  1)  correla¬ 
tion  information  at  a  given  spatial  location  in  an  image  and  to  still  preserve  the  local  focus 
critical  to  exploiting  salient  features  in  nonlinear  filtering. 

To  proceed,  let  0(a;)  be  a  scaling  function  with  a  compact  support  such  that{(;/)(a;  —  n)}n£i, 


is  an  orthonormal  basis  of  Vq,  the  space  of  observations.  Its  Fourier  transform  is  4>{u!)  = 
-^h(|)0(|),  and  consequently  satisfies  the  multiresolution  analysis  framework  [66].  Denote 

by  ijj{x)  a  function  with  a  Fourier  transform  i.e.,  a  corresponding  and 

V2 

so-called  mother  wavelet.  As  is  well  known  [66],  we  can  write 


^<^(1)  =  '^h{n)(j){x-n);  ^^(f)  =  -  n), 


where  {h(n)}„gN  and  {5'(n)}„gN  with  Fourier  transforms  h{u),g{u)  satisfy  the  complemen¬ 
tarity  property  g{uj)  =  -|-  n). 

Define 


LAX)  - 


with  and  as  tight  frames  of  Vj  and  Wj  (the  so-called  approximation 

and  detail  subspaces).  It  follows  that  {(t)j,2n{.x)}  and  {0j,2n+i(a^)}  as  well  as  {t/’j.2n(a^)}  and 
{t/’j.2n+i(a^)}  are  respectively  orthonormal  bases  of  Vj  and  Wj.  With  the  selected  functions 
in  hand,  and  upon  obtaining  a  frame  representation  of  a  signal,  we  first  proceed  to  show 
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how  it  may  be  used  to  implement  a  linear  filter  equivalent  to  that  achieved  by  a  linear  heat 
equation. 

Proposition  3.  The  numerical  implementation  of  a  diffusion  effect  of  Laplacian  Af)  operating 
on  a  signal  U{x)  (as  in  the  linear  heat  equation),  may  be  achieved  by  an  iterative  subtraction 
of  the  highest  detail  (also  referred  to  as  detail  of  detail)  contribution  of  a  Haar  wavelet  packet 
frame  representation  of  the  signal. 


U{n  +  l,x) 


U{n,x)  -  -  1)) 

U  (n,  x)  +  a;  —  1) 


(5.4.1) 


Prior  to  proving  this  proposition,  we  have  to  establish  the  following  two  lemma. 
Lemma  1.  If  we  are  given  a  function  f{x)  together  with  its  frame  representation  coefficients, 


aj{n)  =<  f,  0,, „  >  and  dj{n)  =<  f,  > 
in  a  tight  frame  {n,j)  G  Z  the  following  formulae  yield 

+O0 

aj+i{p)  =  —  p)aj{n)  and 

—  OO 
+O0 

dj+fp)  =  ^g{n-  p)aj{n), 

—  OO 

with  either  of  the  following  two  reconstructions  for  aj(p) 

+  00  +00 

Hip)  =  H+ii‘^n)h{p  -2n)  +  Y^  d,+f2n)g{p  -  2n) 

—  OO  — OO 


or 


+  00  +00 

%(p)  ~  'Ll  ~  —  1)+^^  d„i{2n  +  l)g(p  —  2n  —  1) 


(5.4.2) 


(5.4.3) 


—  OO 


—  OO 


(5.4.4) 
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Proof:  The  proof  is  immediate  by  noting  that 


1 

71^ 


X  —  2^p 


2t+i 


u  —  p 


(fiu  —  n)du 


X  —  2^n 


dx 


=  h{n-p), 


(5.4.5) 


and  similarly  <  0j,2„  >=  g{n  -p).  ■ 

Recall  that  our  goal  is  to  establish  a  direct  relationship  between  a  linear  diffusion  hlter  and 
its  numerical  implementation  in  a  wavelet  domain.  The  space/time  invariance  property  of  the 
linear  diffusion  imposes  a  frame-based  representation  of  a  signal  being  analyzed.  The  choice 
of  a  wavelet  frame  representation  for  this  purpose  is  further  justihed  by  the  intrinsic  analytic 
property  of  wavelets  for  focusing  useful  energy  (  e.g.  of  the  desired  signal)  in  relatively  few 
coefficients  and  for  spreading  that  of  the  noise  over  many  coefficients.  This  consequently 
indicates  that  an  efficient  and  systematic  multiscale  representation  with  a  sufficient  and 
flexible  spectral  redundancy  may  result.  Towards  that  end,  we  start  by  stating  the  following, 


Lemma  2.  The  detail  spaee  Wj  can  he  expressed  as  a  direct  sum  of  two  subspaces  Wj  = 
Defining 


H-OO 

i^lp{x)  =  ^  h{n  -  p)f)^^„{x)  and 

—  OO 
+  00 

^9{n-p)i^jAx),  (5.4.6) 

—  OO 

{'f’fpix)}  and  {'f’fpix)}  are  respectively  frames  of  Vj^L,Wj^L  while 

({+“2n(^)})  rcspectwely  the  corresponding  or¬ 

thonormal  bases  of  and  1+,^ .  Furthermore  denoting  the  coefficients  of  the  decomposition 
of  the  details  in  the  frames  {'f’fpix)}  and  {'f’fpix)}  by  {da^^n)}  and  {ddj{n)},  we  have  per 
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Eq.(  5.4-4),  the  following  reconstruction  relationship 

+00  +00 

dj{n)  =  daj(2n)h{p  —  2n)  +  ddj{2n)g{p  —  2n) 

—  OO  —OO 

or 

H-oo  +00 

dj(n)  =  daj{2n  +  l)h{p  —  2n  —  1)  +  ddj{2n  +  l)g{p  —  2n  —  1). 

—  OO  — OO 

Proof:  Similar  to  the  proof  of  Lemma  1. 

Proof  of  Proposition  1:  See  Appendix  A.  ■ 

5.5  Selection  and  Impact  of  a  Wavelet  Support 

The  Fourier  transform  Uf{n,u)  may  be  written  in  a  Haar  basis  as 

Ut{n,u)  =  g*{u)U{n,u), 

while  that  of  Eq.  (5.4.1)  may  be  written  as, 

U{n  +  l,ci;)  =  (1  +  -g*  {uj)g*  {uj)e~^‘^)U{n,uj)  =  LF{uj)U{n,uj)  (5.5.1) 

where  LF{-)  is  a  low  pass  hlter  resulting  from  Eq.  (5.4.1).  The  transfer  function  correspond¬ 
ing  to  a  Haar  wavelet  g{u:)  with  its  conjugate  denoted  by  g*{uj)  is  given  by 

g{u)  =  V2sin(^)e^^ , 

hence  resulting  in 

LF{uj)  =  (1  -|-  cosuj)/2. 

As  may  be  suggested  by  Eqs.  (5.4. 1,5.5. 1)  as  well  our  earlier  comments,  we  may  select  a 
different  g{-)  (e.g.  a  higher  order  wavelet,  such  as  a  Daubchies-4  etc.)  and  investigate  the 
overall  behavior  of  LF{u)  as  illustrated  in  Eig.  5.2. 

The  choice  the  number  of  vanishing  moments  is  a  degree  of  freedom  which  may  be  opti¬ 
mized  around  specihc  applications  or  goals.  In  Fig.  5.2,  we  contrast  the  characteristics  of  our 
diffusion  hlter  using  a  Haar-based  implementation  of  the  Laplacian  operator  to  that  based 
on  a  higher  order  wavelet  such  as  Daubichies  wavelet  with  vanishing  moments  4(114).  The 
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*  *  *  -s-  *  *  Haar  basis 
- Daubechies  6  basis 


Figure  5.2:  Spectral  characteristics  of  Heat  equation-like  filter  using  Haar  and  Daubechies- 
4/6  wavelet  functions. 


CHAPTER  5.  WAVELET  ERAME-BASED  NONLINEAR  EILTERING 


60 


larger  support  wavelet  (-D4)  exhibits  a  more  graceful  but  nevertheless  sharp  roll-off  of  the 
lowpass  hlter.  This  is  a  direct  result  of  the  selected  support  and  hence  of  the  smoothness 
of  the  wavelet.  The  selection  of  higher  order  wavelets  may  be  justihed  as  a  solution  to  the 
blockyness  problem  pointed  out  in  Section  I  on  several  counts.  Recall  that  our  frame-based 
diffusion  amounted  to  a  progressive  reduction/elimination  of  specihc  frame  coefficients  (high 
details).  The  linearity  of  this  hltering  effectively  assumes  that  the  wavelet  (i.e.  orthogo¬ 
nal)  coefficients  are  uncorrelated  which  as  well  known,  is  inaccurate  and  leads  to  either  a 
signihcant  feature  distortion  or  other  undesired  artifacts  in  an  image  reconstruction.  This 
thus  highlights  the  importance  of  either  accounting  for  the  inter-coefficient  correlation  and 
avoiding  undesired  leakage  of  useful  information  in  incorrectly  deleted  coefficients  or  of  decor- 
relating  the  orthogonal  coefficients  which  may  be  achieved  to  a  large  extent  by  optimizing 
the  wavelet  support.  Increasing  an  analyzing  wavelet  support  (i.e.,  adopting  a  higher  or¬ 
der  wavelet  ijji{x))),  has  been  shown  by  Tewhk  and  Kim [90]  to  yield  increased  decorrelation 
among  the  wavelet  (orthogonal)  coefficients.  This  in  turn,  enhances  the  performance  of  tech¬ 
niques  such  as  wavelet  thresholding  [23,  51,  84]  and  diffusion.  In  concert  with  the  wavelet 
support  selection,  the  redundancy  of  a  frame  operator  which  yields  a  range  rank  reduction, 
affords  additional  noise  elimination  while  preserving  useful  features,  such  as  image  texture, 
as  is  the  goal  herein. 

An  alternative  and  perhaps  more  intuitively  appealing  justihcation  of  using  higher  order 
wavelets  than  Haar  (for  improved  hltered  reconstruction)  follows  upon  recalling  the  assumed 
observed  model  f{x)  =  s{x)  -|-  n{x)  which,  for  simplicity  is  assumed  to  be  1-D. 

Fact:  Increasing  the  support  of  an  analyzing  wavelet  in  the  above  frame-based  diffusion, 
slows  down  the  smoothing  of  smooth/polynomial  trends. 

The  underlying  signal  of  interest  s{x)  is  an  a.s.  continuous  function  with  discontinuities, 
which  may  always  be  represented  as 

i  i 

where  Sc{x)  =  -  and  s^{x)ilA^  respectively  are  the  continuous  and  the  discon¬ 

tinuous  parts  in  interval  A*,  such  that  X  =  IJj  Aj,  and  I  a.  is  an  indicator  function  on  the 
interval.  It  is  well  known  that  s^{x)  may  be  arbitrarily  well  approximated  by  a  polynomial 
Pi{x){see  Fig.(  5.3)  for  an  illustrative  example  with  a  continuous  function  over  the  partition 
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of  interval  and  isolated  discontinuities)  such  that 

I  ~  Pi e  for  a  small  enough  e  . 


On  any  interval  Ai  and  V  a:  G  Aj,  we  have 

f{x)  =  Si{x)  +  sf{x)  +  n{x) 

=  Pi{x)  +  Si{x)  -  Pi{x)  +  sf (x)  +  n{x) 

=  Pi{x)  +  Pi'^ix)  +  Si{x)  -  pi{x)  +  Si{x)  +  n{x) 

=  p^{x)  +  Wi{x),  (5.5.2) 


where  pf{x)  is  a  polynomial  of  order  k  and  p^^{x)  =  Pi{x)  —  Pi{x),  k  G  N,  and  ||  Wi{x)  ||<|| 
Pi'^{x)  +  e  +  sf  +  n{x)  II,  where  ||  ■  ||  is  the  L^  norm.  Given  an  analysis  wavelet  t/’j  (x)  with  k 
vanishing  moments  and  minimum  support,  its  application  to  the  signal  /(x),  results  in  the 
following  coefficients 


df 


f{x)ip{x  —  j)dx  =  d^*  +  dT\  and 


0, 


(5.5.3) 


where  ”f”  denotes  the  analysis  interval  Ai  and  ”j”  the  translation  parameter.  Recall  that 
the  progressive  hltering  of  f{x)  by  the  frame-based  linear  diffusion  entailed  the  smoothing 
of  the  details  of  Yl 

<  ff{x),'ip^{x)  >  i)^^{x)  =  '^dd{^'ip^{x)  =  wf{x) 
j  j 

Recall  that  most  of  the  noise  energy  in  fYx)^  i,  is  projected  onto  the  subspace  which 
includes  /'^'^(x)  and  the  latter  gets  systematically  smoothed  away  from  f{x)  according  to 
Eq.  ??eq:basicdiffusion),  V  x  E  A^ 

fkix)  =  f{x)  -  /"(x)  =  p'^ix)  +  wt{x),  (5.5.4) 


which  when  iterated  leads  to 


fkix)  =  fk  -  ft  =  pHx)  +  wl(x), 


(5.5.5) 
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Simulated  signal  fuction 


10 1 - ^ ^ ^ ^ ^ - 

5  -  - 

0 - 

-5  - 
-10  - 
-15  - 
-20  - 

-25  I - ^ ^ - 1 - 1 - 1 - 

0  50  100  150  200  250  300 

Continuous  part  of  the  signal  funotion 


Approximate  the  continuous  part  of  the  signal  function  with  polynormial  of  order  2 


Figure  5.3:  A  discontinuous  signal  and  its  decomposition  as  sum  of  continuous  part  and 
discontinuous  part,  also  approximation  of  the  continuous  part. 
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A  profile  from  clear,  noisy, filtered  rocks  picture  separately 


Figure  5.4:  A  profile  take  from  the  rock  texture  image,  with  hltered  result 

V  a;  G  Aj  at  step  n,  where  represents  the  iteratively  updated  information  of  Wi{x)  at 
step  n,  or  =  Wi{x)  —  wf^{x),  amounting  to  saying  that  wf^  (details  of  details),  approach 
0  as  n  — >  cx).  The  evaluation  of  estimation  error  may  be  given  as 

^11  II  (5.5.6) 

which  is  seen  to  decrease  as  k'  >  k,  as  ||  s{x)  —p^{x)  ||>||  s(a;)  —p^'{x)  ||,  where  p^{x)  is  the 
integrated  contributions  of  all  polynomials  of  order  k  over  all  intervals  in  the  partition.  This 
hence  clearly  shows  that  if  dj"  =  0,  i  =  1,  •  ■  ■  A^,  as  would  be  the  case  for  a  proper  choice 
of  vanishing  moments  of  the  analyzing  wavelet,  the  continuous  signal  contribution  to  Wi{x) 
would  vanish  and  equivalently  the  preservation  of  all  the  continuous  trends  (polynomials) 
are  projected  onto  the  approximation  subspace  as  demonstrated  in  Fig.(  5.4). 

5.6  Image  Reconstruction  using  a  Haar  Frame 

To  further  investigate  the  interplay  between  PDF-based  hltering  and  multiscale  analysis,  we 
proceed  to  specialize  the  foregoing  development  to  a  Haar  wavelet  frame  and  subsequently 
derive  an  equivalent  diffusion  transformation  similar  to  that  of  a  Heat  equation.  For  clarity 
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of  notation  as  well  algebraic  expediency,  we  adopt  a  matrix  formalism  which  is  convenient 

for  and  compatible  with  an  image  representation  as  a  matrix.  It  is  also  readily  extended  to 

any  wavelet  function  which  may  be  selected  for  the  application  at  hand.  It  is  well  known 

that  a  nonorthogonal  Haar  representation  of  a  signal  may  still  yield  a  reconstruction.  To 

demonstrate  such  a  procedure,  denote  the  impulse  response  of  hlters  corresponding  to  a  Haar 

1 

71’ 

next  construct  a  N  x  N  circulant  matrix  from  a  vector  [a(l),  a(2),  •  •  ■  ,  a(m),  0,  •  ■  ■  ,  Ojixw  as 
Cfr[a(l),  •  •  •  ,  a{m)]p^xN,  and  also  write  as  a  matrix  circularly  shifted  by  k  columns,  i.e. 

'O---  0  1  ■■■  o' 

0  ■■■  0  0  ■■■  0 

_  0  0  ■■■  0  ■■■  1 

R,n  ~ 

1  . 

0  . 

_0---  1  0  ■■■  0_ 

Denote  the  following  circulant  matrices, 

H  =  Cir[h{R),  h(l)]ArxAr  and  G  =  Cir[g{0),  g{l)]NxN 

Property  1.  Let  a  matrix  Aq  denote  an  initial  image.  Its  redundant  representation  using  a 
separable  Haar  function  (i.e.,  obtaining  the  following  spectral  decomposition  Low-Low, Low- 
High,  High-Low,  High- High)  can  be  written  as 

Hi  =  HAqH'-,  Di  =  HAqG'] 

D2  =  GAoH'-  D3  =  GAoG', 

where  “/  ”  denotes  transposition.  The  reconstruction  matrices  can  similarly  be  written  as 

R\  =  h{D)I-  R{  =  g{D)R 
R^  =  Rl  =  g{l)R^ 

In  light  of  the  fact  that  a  redundant  representation  is  given  or  may  be  computed,  the 
exact  reconstruction  methods  have  to  be  carefully  rewritten.  Towards  that  end  we  have  the 
following: 


wavelet  analysis  by  h  =  [h(0),h(l)]  =  [ 
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Property  2.  Denoting  the  partial  reconstruetion  matrices  by 

RA^  =  R^AiRf;RDY  =  R-DiRf]  (5.6.1) 

RD^^  =RfD2Rf-,RDl^  =R!DsRf,  *,j  =  l,2. 

we  may  use  any  of  the  following  four  methods  to  exactly  reconstruct  the  original  image  Aq., 

methodl:  Aq  =  RA]^  +  RD\^  +  RDl^  +  RD]^ 

method2:  Aq  =  RA^  +  RD^^^  +  RD^  +  RD^^ 

methods:  Aq  =  RA^  +  RD^^  +  RDl^  +  RD^^ 

methodf:  Aq  =  RA^  +  RD^^  +  RD^  +  RD^^. 

5.7  Smoothing  in  the  Frame  Domain 

The  above  decomposition  and  reconstruction  procedures  follow  similar  steps  for  other  higher 
order  wavelets  such  as  Daubechies’.  Accounting  for  the  impulse  response  of  corresponding 
filters  leads  to  a  a  slight  modihcation  reflected  in  the  matrices  H  and  G  which  can  be  written 
as 

H  =  Cir[h(0),h(l),h(2),h(3)]Arxv, 

G  =  h,t,*Gir[g{-2),g{-l),g{t)),g{l)]NxN 

Ri  =  h, N  *  Gir[h{i  +  1),0,  h{i  -  l)]NxN,i  = 

Rf  =  Gir[g{i  -  1),0,  g{i  -  3)]NxN,i  =  I, ‘2. 

Following  the  same  strategy  for  Daubechies’  wavelets  as  above,  a  reconstruction  in  a  frame 
may  be  obtained,  and  any  of  the  following  representations  may  be  used 

Ao  =  R^.A^R’l' +  R'lD^Rf  +  RfD^R^,' +  RfDsRf, 

Ao  =  , 

Ao  =  R’IAiR!^' +  R^DiR^' +  RID2R2  +  RfD^R^^' , 

Ao  =  R^2^iR^2  +  R2DiR^2  +  RID2R2  +  RlD^R^ 
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We  next  denote  the  detail  matrix  coefficients  at  the  first  level  by  Di,i  =  1,  2,  3,4  and  at  the 
second  level  by  W-,j  =  1,2,  3, 4.  Armed  with  methods  1-4  to  reconstruct  Di,  D2,  D^,  D4, 
and  using  the  knowledge  that  noise  primarily  dominates  higher  spectral  bands,  we  proceed 
to  effect  the  smoothing  similar  to  that  of  a  Haar  frame-based  linear  diffusion  (i.e.,  progressive 
elimination  of  detail  of  detail  information  from  Aq)  to  result  in  the  following  recursion, 

+  +  (5.7.1) 

Note  that  this  recursion  will  also  achieve  a  linear  diffusion  as  stated  in  Proposition  1,  albeit 
with  modified  intermediate  characteristics.  The  complete  smoothing  witnessed  with  the 
linear  Heat  equation  will  still  be  the  ultimate  fate  of  the  signal  being  filtered.  A  technique 
to  slow  down  such  an  event  is  described  next. 

5.8  Nonlinear  Reconstruction 

Inspired  by  the  algorithms  of  the  first  section  such  as  that  of  Perona-Malik’s  or  that  proposed 
in  [52]  and  to  better  address  the  preservation  of  features,  such  as  texture  which,  however  and 
as  just  shown,  is  eventually  swept  away  by  a  linear  diffusion.  These  features  as  noted  above, 
are  well  captured  by  the  correlation  among  the  coefficients,  which  by  using  the  insight  of 
Section  3,  help  us  proceed  to  construct  a  frame-based  nonlinear  reconstruction  filter.  The 
flexibility  in  properly  selecting  a  wavelet  function  adapted  to  the  texture  of  interest,  together 
with  the  rationale  of  preserving  large  magnitude  coefficients  which  best  summarize  the  un¬ 
derlying  information  while  reducing/eliminating  the  contribution  of  others  as  suggested  by 
Eq.  (5.7.1),  lead  us  to  propose  a  transformation  of  the  individual  coefficients  as 

D,  =  Di*Af{{Dj}).  (5.8.1) 

The  generally  nonlinear  functional  may  take  a  monotonic  form  similar  to  that  proposed  by 
P-M,  where  the  decay  rate  is  selected  based  on  some  prior  knowledge  we  may  have  about 
the  underlying  image. 

For  illustrative  purposes,  we  choose  Af{y)  =  and  hasten  to  point  out  that  other 
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K=2 

K=1 

K=0.5 


Figure  5.5:  One  possible  nonlinear  functional  is  an  exponential  weighting. 
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functionals  adapted  to  other  specific  applications  are  currently  under  investigation.  The  set 
of  coefficients  which  are  subjected  to  the  transformation  are, 

{Di  =  Di*exp{-Df/2K); 

D2  =  D2*exp{-Dl/2K)- 
Ds  =  D,*expi-Dl/2K), 

and  their  insertion  in  the  above  recursive  reconstruction  yields  a  nonlinear  hlter. 

5.9  Experimental  Results 

The  absence  in  our  illustrations  of  blocky  artifacts  or  Gibbs  phenomena  so  common  with 
many  mnltiscale  techniqnes  (wavelet  thresholding)  and  also  robust  scale  space  techniques 
(e.g.  [52]),  not  only  demonstrates  the  effectiveness  of  the  proposed  approach,  bnt  also  points 
to  the  importance  of  the  synergy  that  may  be  gleaned  from  multiscale  analysis  and  scale 
space  methods.  The  performance  of  our  proposed  nonlinear  hlter,  is  readily  assessed  in  the 
henna  pictnre  shown  for  three  different  denoising  techniqnes,  namely,  onr  originally  pro¬ 
posed  techniqne[49],  Perona-Malik’s,  and  the  newly  proposed  technique.  The  ability  of  the 
proposed  technique  to  remove  noise  while  preserving  featnres  like  textnre  is  readily  apparent 
in  Fignres  5.6-  5.7  and  the  importance  of  snch  techniques  in  many  applications  needs  no 
fnrther  elaboration. 
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Noisy  lenna  image 


Fiitered  by  our  random  waik  aigorithm 


Fiitered  by  PM  aigorithm 


Fiitered  by  our  waveiet  aigorithm 


Figure  5.6:  A  noisy  Lenna  image  and  filtered  result  with  three  algorithms. 
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Figure  5.7:  A  texture  image,  noisy  texture  image  and  filtered  result  with  Daubechies  4 
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Original  fabric  image 


Noisy  fabric  image 


Filtered  fabric  image 


Figure  5.8:  A  texture  image,  noisy  fabric  image  and  filtered  result  with  Daubechies  4 
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Clear  picture  of  rocks 


Noisy  picture  of  rocks 


Filtered  picture  of  rocks 


Figure  5.9:  A  texture  with  rocks  image,  its  noisy  image  and  filtered  result  with  Daubechies 
4 
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5.10  Appendix 


Proof  of  Proposition  3.1:  Having  established  the  above  two  lemmas  and  specializing  the 
results  to  a  Haar  function, 


(t){x)  = 


1  0  <  a;  <  1 
0  others, 


{-1  0<a;<0.5 
1  0.5  <  a;  <  1 

0  otherwise, 

for  which  the  wavelet  hlter  impulse  response  is  5f(0)  =  (/(I) 

yields  the  following  detail  coefficient  of  U (n,  a;) 


^ ,  and  which  readily 


W^in.x) 


1 
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U  (n,  a;)  + 


1 
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U{n,x  +  1). 


Towards  establishing  the  iterative  implementation  of  the  linear  diffusion,  we  hrst  invoke 
Eq.(  5.4.3)  of  Lemma  1,  to  write 
From  Lemma  1  and  we  have 


U (n,  2x)  =  h{0)  ■  U°'{n,  2x)  +  g{0)  ■  2x).  (5.10.1) 

To  reconstruct  IT^ln,  x)  from  its  Haar  wavelet  decomposition  coefficients  x),  x), 

we  use  Eq.(  5.4.4)  to  obtain 

U'^{n,2x)  =  h{l)-U'^^{n,2x  -1)  +  g{l)-U^'^{n,2x  -  1).  (5.10.2) 

Combining  Eq.(  5.10.1)  and  Eq.(  5.10.2)  results  in  the  following 
U {n,  2x) 

=  h{0)-lT^{n,2x)+g{0)-{h{l)  ■17“'^(n,  2a;  —  l)+5f(l)  ■7/‘^‘^(n,  2a;  —  1)) 
=h{0)-U‘^{n,2x)+g{0)-h{l)  ■H“'='(n,  2a;  -  1)  +g{0)-g{l)-U^^{n,2x  -  1).  (5.10.3) 

We  may  similarly  obtain 

U{n,2x  +  1) 

=  /i(0)-H“(n,  2a;  +  l)+^(0)-/i(l)-H“'^(n,  2x)+g{R)g{l)-U^^{n,  2a;),  (5.10.4) 

implying  the  following 
U  (n,  a;) 

=  /i(0)-17“(n,  a;)+5f(0)-/i(l)  ■  U^‘^{n,x  —  l)+g{0)g{l)-U'^'^{n,  x  —  1).  (5.10.5) 
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Note  that  the  second  level  detail  component  g{0)g{l)LT^‘^{n,  a;  —  1)  is  equal  to  —\LT^'^{n,  a;  —  1), 
and  is  subtracted  from  the  reconstruction  of  U{n,x)  to  yield  the  updated  U{n  +  l,a;).  The 
following  formula,  equivalent  to  a  Heat  diffusion  equation,  is  hence  established. 


U{n+  l,a;) 


U{n,x)  -  {-^U'^‘^{n,x  -  1)) 
U  (n,  x)  +  a;  —  1) 


(5.10.6) 


Chapter  6 


Independent  Component  Analysis 


Independent  component  analysis(ICA),  a  data  analysis  concept  that  was  first  introduced  by 
C.Jutten  and  J.Herault  [47]  and  also  known  as  blind  source  separation(BSS)  in  applications, 
has  been  of  intense  research  interest  in  a  number  of  application  helds  (  for  instance,  speech 
recognition,  remote  sensing  and  biomedical  imaging).  Much  has  been  accomplished  [16,  14, 
13,  42,  7,  76,  2]  including  the  ICA  demonstrated  potential  in  signal/image  enhancement.  It 
may  be  viewed  as  a  natural  extension  to  standard  principal  component  analysis(PCA),  which, 
as  is  well  known,  is  based  on  the  correlation  structure  of  observed  data.  In  this  chapter,  we 
investigate  novel  and  efficient  ways  of  carrying  out  an  ICA  using  novel  information  theoretic 
criteria. 

6.1  Linear  ICA  models 

Consider  a  set  of  observed  signals  from  multiple  sensors,  each  sensor  receiving  a  different 
combination  of  the  source  signals,  which,  for  simplicity,  are  assumed  to  be  random  variables, 
since  if  viewed  as  sample  paths  of  a  random  process,  more  complex  models  are  required. 
Hence,  representing  data  as  random  vectors,  as  we  elaborate,  facilitates  the  use  of  statistical 
methods  such  as  entropy,  correlation  and  measurement  of  redundancy,  and  turn  out  to  be 
a  powerful  model.  Although  a  general  data  model,  such  as,  x  =  f{s)  is  desirable,  here 
X  =  (Xi,  ■  ■  ■  ,  Xm)  are  outputs,  s  =  {Si,  ■  ■  ■  ,  Sn)  is  a  source  random  vector  and  /  is  a 
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transform  function,  most  applications  assume  a  linear  transform  in  the  form  of  a;  =  As.  To 
proceed  with  describing  a  linear  ICA  model,  we  will  assume  that 

•  The  source  signals  are  independent  random  variables 

•  The  distributions  of  the  source  signals  are  unknown 

The  task  in  using  of  independent  component  analysis(ICA)  or  Blind  source  separation(BSS) 
is  to  recover  independent  source  signals  from  mixed  observations  under  these  assumptions. 

The  linear  BBS  or  ICA  model  assumes  the  existence  of  n  independent  source  signals 
s{k)  =  (S'i(/c),  S2{k),  ■  ■  ■  ,  Sn{k)),  and  x{k)  =  (Xi(/c),  X2{k),  •  •  ■  ,  X^(/c)),  /c  =  1,  •  •  •  ,  77 
are  observed  signal  samples  from  m  sensors,  which  are  linear  mixtures  of  source  signals  in 
the  presence  of  additive  noise  n{k)  =  (A^i(/c),  N2{k),  ■  ■  ■  ,  N^ik)),  and  more  simply, 

x{k)  =  As{k)  +  nik) 

where  A  is  an  unknown  full  column  rank  mxn  matrix  that  accounts  for  the  linear  mixtures 
of  the  signals,  with  K  observed  discrete  samples.  In  this  model,  s{k)  and  x{k)  could  be 
complex  signals  and  m  >  n  is  usually  imposed  (  some  special  m  <  n  cases  have  been  studied 
in  [17,  59,  10]  ),  however,  we  only  consider  the  simplest  model,  namely,  real- valued  signals, 
m  =  n,  and  noise  free  observations,  i.e., 

x{k)  =  As{k), 

since  this  model  captures  the  essence  of  the  ICA  or  BSS  problem,  where  noise  is  usually 
considered  as  a  nuance  parameter  and  is  largely  neglected  in  the  literature.  An  example  of 
mixed  signals  obtained  from  independent  sources  subjected  to  rotation  is  displayed  in  Fig. 
(6.1). 

To  recover  signals  from  mixed  data  is  to  estimate  a  full  rank  matrix  W  such  that  the 
estimated  signal  components  of  y{k)  =  (Yi{k),  •  ■  ■  ,  Yn{k))  are  as  independent  as  possible 
and  the  outcome  sources  space  (  may  be  permuted  and  scaled  )  are  close  to  source  signals, 
namely 

y{k)  =  Wx{k)  =  WAs{k)  =  Bs{k). 
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Independent  source  signals 


2 1 - 1 - 1 - 1 - 1 — 

1  - 1  I - 1  I - 1  I - 1  I - 1  I — 

0  - 

-1  -  I - 1  I - 1  I - 1  I - 1  I - 

_2l - 1 - 1 - 1 - 1 — 

0  500  1 000  1 500  2000 


Mixed  signals  obtained  by  linear  rotation 


Figure  6.1:  Independent  source  signals  and  mixed  signals  obtained  by  rotation. 
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We  assume  that  at  most  one  of  the  source  signals  Si{k)  is  allowed  to  have  a  Gaussian 
distribution.  This  follows  from  the  fact  that  it  is  impossible  to  separate  several  Gaussian 
sources  from  each  other  [16]. 

6.2  Existing  ICA  Algorithms 

IGA  algorithms  are  closely  related  to  principal  component  analysis  [46]  ,  factor  analysis  [36] , 
and  projection  pursuit  algorithms  [30,  41].  In  many  IGA  algorithms,  it  is  required  that 
mixture  data  be  normalized  by  its  variance  and  be  pre-whitened,  which  means  that  we  can 
always  assume  a  unity  variance  for  each  component  (we  also  assnme  0  mean-valued  data, 
as  in  practice,  we  can  always  subtract  the  mean  valne  from  the  original  data  )  and  that 
there  exists  a  whitening  transform  matrix  V  snch  that  the  whitened  data  U  =  VX  has  a 
correlation  matrix  R  =  E{UU'^)  =  I,  where  I  is  an  identity  matrix. 

Farther,  by  assnming  that  at  most  one  of  the  components  is  Gaussian  distribnted,  and 
a  fnll  rank  matrix  A,  which  together  with  the  fact  that  independence  is  not  affected  by 
different  ordering  and  scaling  of  the  components,  a  nnique  solntion  to  IGA  is  assured  in  the 
sense  that  permntation  and  scaling  are  allowed  for  each  component. 

Upon  observing  the  mixed  signals  x  =  (Xi,  ■■■  ,  X„),  many  algorithms  seeking  to 
estimate  the  matrix  W  have  been  proposed  on  the  basis  of  information  theoretic  as  well 
as  neural  network-based  criteria.  These  algorithms  consist  of  optimizing  proposed  objective 
functions  known  as  contrast  fnnctions,  or  cost  fnnctions  [16].  Valid  contrast  functions  must 
be  designed  in  such  a  way  that  the  source  separation  is  achieved  when  they  reach  their  optimal 
(  minimnm  or  maximnm  )  values.  Gommon  contrast  fnnctions  are  based  on  measnres  such 
as,  entropy,  high-order  cnmnlants,  divergence  among  the  joint  probability  density  fnnctions 
of  observed  data  for  independent  models. 

The  hrst  nenral  source  separation  algorithm  was  presented  in  [47] ,  and  has  used  the  prin¬ 
ciple  of  cancelling  non-linear  cross-correlations  of  the  form  E{gi{Yi)g2{Yj)}  (where  gi,  g2  are 
some  snitably  chosen  odd  non-linear  functions,  and  V,  Yj  are  estimations  of  source  indepen¬ 
dent  signals  )  to  achieve  independent  components,  which  in  tnrn  implies  that  E{gi{Yi)g2{Yj)}  = 
E{gi{Yi)}E{g2{Yj)}  when  V,  Yj  are  independent.  Fnrther,  a  nonlinear  objective  fnnction 
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based  on  a  generalization  of  PCA  is  proposed  by  introducing  a  nonlinear  function  g{x)  to 
seek  out  principal  components  by  (  see  [44,  70]  ), 

it>i=arg^max  E{{g{w'^x)Y}.  (6.2.1) 

Inspired  by  neural  networks,  a  contrast  function  was  derived  in  [7]  on  the  basis  of  maxi¬ 
mizing  the  Shannon  entropy  of  non-linear  outputs  from  a  neural  network.  Specihcally,  one 
is  to  maximize  the  following  formula 

L  =  H{gi{wlx),  ■  ■  ■  , gmiwl^x)),  (6.2.2) 

where  x  is  the  input  to  a  neural  network  whose  outputs  are  of  the  form  gi{wjx)  and  Wi  are 
the  weight  vectors  of  the  neurons.  Here  Shannon  differential  entropy  of  a  continuous  random 
variable  (or  a  random  vector)  X  with  a  probability  density  function  (pdf)  f{x)  is  dehned  as 

H{X)  =  -  [  f  {x)  log  f{x)dx.  (6.2.3) 

Jn 

In  [16],  it  was  proposed  to  use  mutual  information  as  a  contrast  function  to  more  fully  de¬ 
scribe  the  statistical  independence  property.  In  a  probabilistic/information  theoretic  setting, 
researchers  have  held  the  mutual  information  measure  as  one  of  choice  in  identifying  a  proper 
representation  basis  of  random  samples  of  signals/sources,  and  for  which  the  resulting  coeffi¬ 
cients  are  independent.  This  is  an  interesting  property  of  source  independence  as  it  does  not 
include  any  implicit  or  explicit  assumption  about  the  distributions  of  the  sources.  Mutual 
information  for  a  set  of  random  variables  (hi,  ■  ■  ■  ,  Yn)  may  be  written  as 

n 

I{Y,,  ■■■  ,  F,,)  =  5^77(17) -i7(Fi,  ,  FO  (6.2.4) 

i=l 

Using  the  maximum  likelihood  principle  and,  if  we  assume  a  source  vector  s  is  distributed 
according  to  a  pdf  q{s),  a  contrast  function  is  formulated  as  the  log-likelihood  function 
of  a  pdf  of  the  estimated  output  y  =  A~^x,  for  T  samples  a;(l),---  ,x{T),  and  output 
y{l),  •  ■  ■  ,  y{T),  to  take  the  form  [13,  18]: 

1  ^ 

0ml(2/)  =^ogp{y)  =  -'^log q{A-^x{k))  -  \og{det{A)) . 

k=l 


(6.2.5) 
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By  the  law  of  large  numbers,  when  T  — oo,  the  log-likelihood  function  converges  to  the 
expectation  of  logg(A~^a;)  -|-  constant,  denoted  as  0ml(2/)  =  —E{\ogq{A~^x)}.  This  may 
also  be  motivated  by  the  Kullback-Leibler  divergence  K{f\g),  which  can  be  viewed  as  a 
distance  between  two  probability  density  functions  /  and  g  (  though  it  is  not  a  real  distance 
measure  because  it  is  not  symmetric)  and  is  also  written  as  K{X\Y)  where  /  and  g  are  pdfs 
of  two  random  variables  X  and  Y .  The  dehnition  of  Kullback-Leibler  divergence  [53]  is  the 
following 

(II)  ..  ,6.2.0 

This  in  turn  may  be  used  to  establish  that  the  log-likelihood  function  (pMciv)  ^lay  be  written 
as 

(j)MLiy)  =  K{y\s).  (6.2.7) 

The  Maximum  Likelihood  principle  thus  attempts  to  find  a  matrix  A  such  that  the  distribu¬ 
tion  oi  y  =  A~^x  is  as  close  as  possible  to  the  hypothesized  joint  distribution  of  the  sources. 
Denote  ^  as  a  random  vector  with  independent  entries,  and  each  entry  distributed  as  the 
corresponding  entry  of  y  {  y  is  thus  called  the  factorial  distributed  random  variable  of  K), 
we  see  that 

K{y\s)  =  K{y\y)  +  K{y\s),  (6.2.8) 

for  any  vector  with  independent  entries.  Eq.  (6.2.8)  shows  that  the  minimum  value  of 
K{y\s)  is  reached  by  minimizing  both  terms, 

1.  the  hrst  right  term  of  Eq.  (6.2.8),  which  is,  in  fact,  mutual  information,  therefore 
required  independence  among  each  entry  of  y 

2.  the  second  right  term  of  Eq.  (6.2.8),  which  requires  that  the  individual  entries  of  y 
have  the  same  distribution  as  those  of  s. 

we  can  thus  see  that,  if  the  source  distributions  are  known,  (pML  is  more  accurate  be¬ 
cause  it  expresses  directly  the  htness  between  data  and  model.  The  initial  distributions  of 
the  source  signals  are  however,  usually  unknown,  which  reduces  the  maximum  likelihood 
algorithm  to  that  of  maximizing  mutual  information  by  only  considering  an  independence 
property,  which  of  course  remains  an  important  feature  of  the  sources. 


K(f\3)=  [ /Wlog 
Js 
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It  is  also  proved  in  [13]  that  the  principle  of  neural  network  entropy  maximization,  namely, 
the  infomax  principle,  is  equivalent  to  maximum  likelihood  estimation  by  properly  choosing 
a  function  g  =  {gi,  •  •  ■  ,  gm)  in  Eq-  (6.2.2)  as  a  cumulative  distribution  function  (  cdf  )  of 
the  source  variables,  and  if  known. 

While  the  above  measures  are  sound  and  theoretically  appealing,  their  big  drawback  is 
in  having  to  estimate  the  pdf’s  or  the  entropy  which  is  not  always  trivial.  Several  approx¬ 
imations  to  MI  based  on  polynomial  Taylor  density  expansions  have  been  proposed  and 
yielding  contrast  functions  based  on  higher  order  cumulants  [16,  12].  Cumulants  of  order  2 
and  4  have  been  predominantly  used  and  yielding  for  example  the  following  approximations 
ol  y  =  {Yi,  •  ■  •  ,  Yn)  of  the  form  [16], 

1 

I{y)  ~  C'  +  -  +  7h{Yi)^  -  6h{Y^^h{Yi)]  (6.2.9) 

i=l 

where  ki{X)  =  E{X^),  i  =  1,2,  ■  ■  ■  ,  are  i  —  th  order  cumulants  of  a  random  variable  X  and 
C  is  a  constant  [16].  The  approximation,  however,  is  valid  only  when  the  density  function  of 
Y  is  close  to  the  Gaussian  probability  density  function,  otherwise,  poor  estimate  may  result. 

Higher-order  cumulant  tensors  have  also  been  directly  used  as  criteria  by  taking  advantage 
of  the  prevailing  algebraic  structure  [9,  10,  11],  a  good  review  may  be  found  in  [12].  This 
method  consists  of  looking  for  the  eigenvectors  of  a  higher  order  cumulant  tensor.  The 
fourth-order  cumulant  tensor  can  be  dehned  as  the  following  linear  operator  T  from  the 
space  of  m  X  m  matrices  to  the  space  of  rrR  x  matrices  with  the  i,j  element  of  T  as: 

T{K)^^  =  J2cum{X„  Xj,  Xk,  Xi)Ku  (6.2.10) 

k,l 

where  cum{Xi,  Xj,  X^,  Xi)  denotes  the  fourth-order  cumulant  and  the  subscript  ij  means 
the  (i,  j)  —  th  element  of  a  matrix,  and  Kki  is  amx  m  matrix.  This  linear  operator  has 
eigenvalues.  Solving  for  the  eigenvectors  of  this  eigenmatrix  would  lead  to  an  estimation  of 
the  ICA  model.  The  advantage  of  this  method  is  that  it  requires  no  knowledge  of  the  dis¬ 
tribution  of  the  independent  source  components,  with  the  understanding  that  the  efficiency 
issue  constraints  it  to  small  dimensions. 

With  emergence  of  approximation  techniques  of  differential  entropy,  more  sophisticated 


CHAPTER  6.  INDEPENDENT  COMPONENT  ANALYSIS 


82 


approximations  of  mutual  information  may  be  constructed  and  applied  to  ICA  [43,  22],  Non- 
parametric  estimation  of  mutual  information  [19]  based  on  dependent  data  also  provides  a 
useful  technique  to  directly  implement  ICA  algorithms  and  further  motivating  the  investiga¬ 
tion  of  alternative  measurements,  such  as  Jensen-Renyi  divergence  [37,  1]  and  Renyi  mutual 
divergence  as  criteria  to  ICA  [5].  The  technique  developed  to  approximate  a—  Renyi  entropy 
and  Renyi  divergence  are  also  described  in  [5,  1,  38,  39]. 

Practical  consideration,  such  as  an  unreliable  faulty  source,  or  that  the  underlying  sources 
may  in  fact  be  Gaussian,  may  lead  one  to  favor  a  technique  which  would  avoid  over¬ 
assumptions  about  the  prevailing  statistics  over  another  with  selectable  data  ranges  (e.g. 
ignore  large  outliers)  and  with  varying  degrees  of  weighted  contributions  instead  of  the  com¬ 
monly  used  uniform  equal  weighting  as  is  typical  of  many  existing  measures,  such  as  weighted 
covariance  matrix. 

Miscellaneous  alter  existing  approaches  are  deferred  to  the  literature  [76],  [2]  and  refer¬ 
ences  therein  as  we  have  instead  focused  on  techniques  relevant  to  our  later  discussing. 

6.3  Applications  of  ICA 

ICA,  or  BBS  techniques  have  been  applied  in  any  helds  where  an  array  of  m  receivers  collect 
data  of  linear  mixtures  of  n  source  signals.  Examples  include  speech  separation  (  known  as 
’cocktail  party  problem’  )  as  several  microphones  are  placed  in  different  points  while  there 
are  several  speakers.  It  may  be  also  applied  in  processing  arrays  of  radar  or  sonar  signals 
and  processing  of  multi-sensor  biomedical  recording  signals,  such  as  EEC,  MEG  signals  used 
to  record  brain  activities.  Medical  imaging  such  as  fMRI,  processing  of  geophysical  data  and 
restoration  of  image  features  are  also  other  typical  applications. 


Chapter  7 


New  measure  criteria  for  ICA 


As  we  mentioned  earlier,  mutual  information(MI),  as  a  measure  between  two  probability 
densities  has  been  intensively  used  by  many  authors  as  a  contrast  function  for  ICA.  Its 
estimation  is  complicated  by  having  to  use  empirical  density  functions,  which,  results  in  a 
weakness  to  estimate  entropies.  Research  has  mainly  focuses  on  finding  higher-order  ap¬ 
proximations  of  mutual  information  or  different  techniques.  This  criterion,  namely,  mutual 
information,  however,  has  recently  been  fallen  out  of  favor  to  other  information  measures, 
such  as  a— Jensen- Renyi  (  a— JR  )  divergence  [37].  This  divergence  measure  provides  a 
distance  among  a  group  of  probability  densities,  and  thus  can  serve  as  an  alternative  and 
improved  criterion  to  ICA.  Also  with  the  increasing  interest  in  simplified  and  robust  estima¬ 
tions  of  mutual  information,  some  recent  results  [1,  19]  are  signihcant  enough  to  allow  us  to 
reasonably  consider  applying  them  in  ICA  investigations.  In  this  chapter,  we  propose  Renyi 
mutual  divergence  as  a  new  criterion  on  account  of  its  additional  features  over  mutual  infor¬ 
mation.  We  also  propose  a  technique  to  approximate  Renyi  mutual  divergence  by  analyzing 
dependent  data,  and  discuss  in  detail  its  properties  later  in  this  chapter. 

7.1  Definition  of  Renyi  Entropy 

To  introduce  two  divergence  measures  based  on  the  Renyi  entropy,  an  information  measure 
that  was  first  introduced  by  Renyi  in  [81,  82],  which  has  been  shown  to  be  theoretically  as 
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well  as  practically  useful,  we  provide  a  brief  overview  on  a— Renyi  entropy. 

For  a  discrete  random  variable  X  with  sample  space  whose  corresponding  probability 
distribution  P(xi)  =  P(X  =  Xi),  Xi  G  12,  i  G  I,  where  /  is  an  index  set,  and  for  a  G 
(0,  2],  a  ^  1,  the  a— Renyi  Entropy  is  dehned  as 

//«(P(a;))  =  ^  In  j  ,  (7.1.1) 

with  ”ln”  denoting  the  natural  logarithm. 

For  a  continuous  random  variable  X  with  a  probability  density  function  (pdf)  p(x),  x  G 
7Z^,  a— Renyi  entropy  is  dehned  as 

H^{p{x))  =  — In  p°‘{x)dx^  ,  (7-1-2) 

where  a  G  (0,2],  a  ^  1. 

We  can  see  that,  when  a  ^  1,  a— Renyi  entropy  degenerates  to  the  Shannon  entropy 
[85],  which  is  dehned  as 

H{p{x))  =  —  j  p{x)\np{x)dx,  (7.1.3) 

for  a  continuous  random  variable  and 

H{P{x))  =  -5^P(a;,)lnP(a;i),  (7.1.4) 

i&I 

for  a  discrete  random  variable. 

The  advantage  of  Renyi  entropy  is  that  the  probability  (or  the  probability  density)  is 
modulated  by  a  factor  a,  which  along  with  its  simpler  form  make  Renyi  entropy  easier 
to  implement,  and  with  better  adapted  statistical  properties  to  a  random  variable.  Renyi 
entropy  is  concave  for  all  a  G  (0,2]  and  Fig.  (7.1)  demonstrates  this  property  for  a  specihc 
example  with  Bernoulli  distributions  for  diherent  a  G  (0,2]. 

7.2  Jensen-Renyi  divergence  as  a  new  criterion  for  ICA 

7.2.1  Introduction  to  a— Jensen-Renyi  divergence 

a— Jensen-Renyi  (a— JR)  divergence  is  a  new  concept  recently  proposed  in  [37],  and  may 
be  viewed  as  a  generalization  of  Renyi  entropy  [81,  82]  and  of  Jensen-information  [61],  see 
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Figure  7.1;  Renyi  entropy  of  Bernoulli  distributions  at  several  a  compared  to  Shannon 
entropy 

[1]  for  a  thorough  investigation  of  Renyi  entropy  within  the  context  of  Minimum  Spanning 
Trees(MST).  The  a— JR  divergence  as  an  information  measure,  invokes  a  weighted  combina¬ 
tion  of  several  distributions  whose  contribution  is  modulated  by  way  of  an  a-exponentiation 
parameter  (Viz.  Eq.  7.2.1). 

The  generalized  nature  of  this  measure  has  shown  very  promising  results  in  applica¬ 
tions  [37]  and  exhibits  a  wide  scope  of  applicability  tied  to  one’s  ability  to  reinterpret  the 
population  densities  and  to  hence  reexpress  the  measure  itself.  Specihcally,  this  measure 
may  simply  be  interpreted  as  one  of  independence  between  two  or  more  probability  density 
functions  (data  population)  and  is  therefore  naturally  applicable  to  the  well  known  problem 
of  independent  component  analysis  (ICA)  (  or  also  known  as  the  Blind  Source  Separation 
problem) . 

In  light  of  this  practical  interest  and  of  the  intrinsic  properties  of  the  a— JR  divergence 
measure,  we  propose  it  be  the  basis  for  a  new  criterion  as  further  elaborated  on  in  the  next 
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subsection.  In  spite  of  its  numerous  advantages,  our  proposed  technique  nevertheless  unveils 
a  limitation  in  its  computational  implementation,  where  a  bottleneck  emerges  in  the  course 
of  estimating  probability  densities. 

7.2.2  a— Jensen- Renyi  divergence  as  an  Independence  Measnre 

For  a  set  of  probability  distributions  {R(a;)}j=i...n,  a— JR  divergence  is  dehned  as 

n  n 

=  R“(R(a;)  (7.2.1) 

i=l  i=l 

with  =  1,  0  <  cUj  G  77  and  0  <  a  <  2.  The  a— JR  divergence  measures  the  distance 

between  two  or  more  distributions  by  adjusting  weights  on  different  distributions. 

By  Jensen’s  inequality,  one  can  show  that  the  above  expression  is  minimized  and  achieves 
a  0-value  if  the  distributions  are  identical,  and  is  maximized  when  they  are  all  different  (  i.e. 
each  distribution  function  is  a  Dirac  function  positioned  in  different  locations).  If  applied  to  a 
set  of  conditional  distributions  of  X,  R,  for  example,  a  minimization  of  the  a— JR  divergence 
would  be  tantamount  to  establishing  that  the  distributions  of  X  conditioned  on  R,  for  all  R, 
are  equal,  hence  implying  the  independence  of  X  and  R.  It  is  worth  noting  that  this  measure 
provides  an  additional  flexibility  of  choosing  a  and  cuj,  hence  affording  the  selectivity  among 
the  data  as  mentioned  earlier. 

Given  the  essence  of  an  ICA  problem,  the  htting  measure  to  adopt  is  that  of  independence, 
which  raises  a  natural  question  of  how  to  reinterpret  the  a— JR  divergence  to  elicit  such 
information  contained  in  observed  random  variables  from  different  populations.  By  dehning 
in  Eq.  (  7.2.1)  Pi{x)  =  P{X  =  x\Y  =  pi)  (i.e.,  as  a  conditional  probability),  we  can 
easily  conclude  that,  JR^'‘^'‘^^  '”'{{pi{x)}i=i...n),  denoted  by  JRa{X,Y),  yields  a  measure  of 
independence  between  X  and  R  when  it  is  minimized. 

7.2.3  Application  to  ICA 

To  proceed  with  the  description  of  the  source  separation  problem,  we  denote  the  observations 
Xi,  i  =  1  ■  ■  ■  n,  X  =  (Xi,  X2,  •  •  ■  ,  X„)  as  a  result  of  a  mixing  action  of  an  unknown  matrix 
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A  on  source  data  s  =  (Si,  S2,  ■  ■  ■  ,Sn),  expressed  as  a  sequence  of  independent  random 
variables  S'*,  i  =  1  ■■  -  n  of  0-mean  and  unit- variance,  and  more  explicitly  as  a  linear  model 

X  =  As. 

Our  goal  is  to  then  recover  s  from  merely  observing  x.  The  solution  to  this  problem  requires 
one  to  typically  hrst  proceed  to  whiten  the  data,  i.e.,  diagonalize  the  data  covariance  matrix 
so  that  In  =  W AA^W^ ,  where  ”r”  denotes  transposition  and  In  is  an  identity  matrix.  The 
subsequent  step  is  to  search  for  an  adapted  pairwise  rotation  of  axes  to  yield  independent 
data  along  these  directions,  and  effected  by 

j  0  j 

where  Xf,Xj  are  the  corresponding  random  variables  to  Xi,Xj  rotated  by  an  angle  6.  By 
iterating  this  processing  to  other  pairs,  all  of  the  independent  components  are  gleaned. 

To  illustrate  the  proposed  technique,  we  provide  two  mixture  cases:  the  hrst  shown  in 
Fig.  (7.2),  consists  of  two  acoustic  speech  signals,  and  the  second  shown  in  Fig.  (7.4) 
includes  signals  with  heavy  tail  distributions.  The  recovered  signals  in  both  cases  are  shown 
in  the  corresponding  hgures,  and  demonstrate  the  effectiveness  of  the  proposed  technique. 
In  Figs.  (7.3)  and  (7.5)  we  compare  and  display  the  potential  gains  of  using  JR  divergence 
over  mutual  information.  The  sharper  and  more  signihcant  nulls  of  the  JR  measure  suggest 
a  resilience  and  additional  robustness  in  the  presence  of  perturbations  such  as  estimation 
errors. 

It  is  important  to  note  that  the  choice  of  parameters  ”uJi”  and  ’’a”  in  this  example  have 
not  been  necessarily  optimized  (uniform  prior  chosen  somewhat  arbitrarily  and  ”a”  selected 
in  light  of  the  underlying  signals  (i.e.  1  <  a  <  2  for  super-gaussian  processes  and  0  <  a  <  1 
for  sub-gaussian  processes).  This  in  fact,  and  to  the  best  of  our  knowledge,  remains  an  open 
problem. 
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7.3  a— Renyi  mutual  divergence  as  a  New  Criterion  for  ICA 

7.3.1  Introduction  to  a— Renyi  divergence 

a— Renyi  divergence,  an  important  information  divergence  introduced  by  Renyi  [81,  82],  can 
also  be  broadly  viewed  as  a  distance  measure  between  two  probability  density  functions  in 
spite  of  its  non-symmetric  structure  ( it  may  be  made  symmetric  at  a  cost  of  a  more  complex 
form),  it  is  defined  as 

RDa{f,g)  =  (^J  rix)9^~‘"{x)dx^  ,  (7.3.1) 

where  0<q;<2,q;7^1  and  f{x),  g{x),  x  G  are  two  probability  density  functions  of  two 
random  variables  X  and  Y .  This  divergence  shares  the  same  maximization  and  minimization 
points  as  a— JR  divergence  and  KL-divergence,  namely,  the  minimization  point  is  reached 
when  f{x)  =  g{x),  the  maximization  point  is  reached  when  f{x),g{x)  are  totally  different  ( 
or,  one  of  the  densities  f{x),g{x)  should  be  a  Dirac  function  in  IZ^  ). 

The  dehnition  of  a— Renyi  divergence  is  fairly  general  in  that  some  other  divergences  are 
its  special  case  for  a  specihc  value  of  a.  We  can,  for  instance,  see  that  KL-divergence  is 
obtained  when  a  — >  1, 

UmRDa{f,g)  =  jf{x)lnj^dx,  (7.3.2) 

and  when  a  =  1/2,  the  so-called  Battacharya  distance  is  obtained 

RDi/2{f,g)  =  -2  In  (^J  f{x)g{x)dx^  .  (7.3.3) 

As  is  well  known,  KL-divergence  becomes  the  mutual  information  if  we  take  f{x)  as  a 
joint  probability  density  of  a  random  vector  x  =  (Xi,  •  ■  ■  ,X„),  where  Xj,  i  =  1,  •  •  ■  ,n.  is 
a  random  variable  with  IZ^  as  a  sample  space,  and  g{x)  as  the  factorial  probability  density 
in  the  form  of  g{x)  =  fi{x)  ■  ■  ■  fn{x),  where  fi{x),  i  =  1,  ■  ■  ■  ,n,  a;  G  IZ'^  is  the  pdf  of  each 
individual  random  variable  Xj,  i  =  When  the  two  probability  densities  are  used 

to  write  a  a—  Renyi  divergence,  it  is  called  a— Renyi  mutual  divergence,  and  is  denoted 
as  MDa{x)  or  MDa{Xi,  ■  ■  ■  ,Xn).  a— Renyi  mutual  divergence  measures  the  dependency 
among  a  group  of  random  variables  Xi,  •  ■  ■  ,  X„  by  the  distance  between  its  joint  pdf  and  its 
factorial  pdf. 
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Thus,  a— Renyi  mutual  divergence,  as  a  generalization  of  mutual  information,  can  serve 
as  an  independence  information  measure  with  its  value  depending  on  the  index  a  applied 
to  each  distribution.  In  the  following  subsection,  we  restrict  a— Renyi  mutual  divergence  to 
two  random  variables  {X,  Y)  and  prove  a  theorem  that  compares  the  degree  of  independence 
measurement  of  mutual  information  and  Renyi  mutual  divergence  between  the  two  random 
variables.  This  therefore  provides  a  theoretical  argument  for  one  to  apply  Renyi  mutual 
divergence  to  ICA. 

7.3.2  Comparison  between  a— Renyi  mutual  divergence  and  Mutual  Informa¬ 
tion 

In  order  to  illustrate  the  advantage  of  using  a— Renyi  mutual  divergence  (with  1  <  a  <  2) 
as  a  criterion  for  ICA,  we  prove  the  following  theorem. 

Theorem  9.  For  two  continuous  random  variables  X,  Y  with  joint  pdf  h{x,  y)  and  marginal 
distributions  f{x),  g{y),  we  have  the  following  ineguality  between  a— Renyi  mutual  divergence 
MDa{X,Y)  and  mutual  information  I{X,Y), 
case  1:  Given  0  <  a  <  1, 

MD^{X,Y)<I{X,Y) 

ease  2:  Given  1  <  a  <  2, 

MD^{X,Y)>I{X,Y) 

Proof:  For  the  joint  probability  density  h{x,y)  of  random  variables  X,Y  and  the  corre¬ 
sponding  marginal  pdfs  f{x),g{y)  ,  the  a— Renyi  mutual  divergence  may  be  written  as 

MD^{X,Y)  = 


In  /  h‘^{x,y)  {f{x)g{y)Y  dxdy 


a  —  1 
1 


a—1 


1 


a—1 


a  —  1 


\f{X)9{Y)J 


(7.3.4) 
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Since  the  function  p{x)  =  ln(a;)  is  strictly  concave,  according  to  Jensen-Inequality,  we 
have 

E{p(X)}<p(E{X}) 

from  which,  we  have  that,  for  two  random  variables  X,Y, 


Mm) 


a—1 


mm:) 


a—1 


(7.3.5) 


when  0  <  a  <  1,  we  have  a  —  1  <  0,  thus 


MD^{X,Y)  = 


< 


a  —  1 
1 


InE 


(  hiX,Y)  \ 
\f{X)g{Y)J 


a—1 


a—1 


-  am) 

=  HX,Y) 


(7.3.6) 


when  1  <  CK  <  2,  we  have  a  —  1  >  0,  thus 


a—1 


mdjxx}  - 


> 


a  —  1 
1 

q;  —  1 


a—1 


\f{X)g{Y)) 


=  '{'-(im) 

=  nx,Y) 


(7.3.7) 


Noted  that  '  ='  is  established  only  when  h{x,y)  =  f{x)g{y),  namely  X,Y  are  indepen¬ 
dent,  which  concludes  the  proof.  ■ 

We  here  give  two  examples  of  this  theorem  with  value  a  =  0.2  and  a  =  1.8,  Fig.  (7.6)  and 
Fig(7.7).  These  two  pictures  compare  the  theoretical  values  of  a— Renyi  mutual  divergence 
of  a  sequence  of  Gaussian  sources  with  that  of  mutual  information,  the  formula  of  the 
CK— Renyi  mutual  divergence  of  a  sequence  of  Gaussian  sources  is  calculated  in  Section  7.5, 
we  also  draw  the  approximations  of  ck— Renyi  mutual  divergence  and  mutual  information  in 
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Figure  7.6:  Approximated  0.2— Renyi  mutual  divergence  and  its  exact  theoretical  value  com¬ 
pared  to  mutual  information 


these  two  pictures,  again,  detail  knowledge  about  the  approximation  is  deferred  to  Section 

7.5. 


7.4  Application  to  ICA 


As  a  result  of  the  above  theorem,  we  can  see  that  when  1  <  a  <  2,  a— Renyi  mutual 
divergence  is  a  better  adapted  measure.  This  is  made  more  compelling  when  considering  the 
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Figure  7.7:  Approximated  1.8— Renyi  mutual  divergence  and  its  exact  theoretical  value  com¬ 
pared  to  mutual  information 


fact  that  a— Renyi  divergence  is  a  concave  function  in  each  distribution,  and  that  a— Renyi 
mutual  divergence  attains  a  0  minimum  when  two  random  variables  are  independent,  and  a 
maximal  77^“" (X)  when  the  two  random  variables  are  fully  dependent.  In  our  application 
of  such  a  measure,  we  propose  a  non-parametric  estimation  of  a— Renyi  mutual  divergence 
based  on  dependent  data  as  shown  in  section  7.5. 

The  target  function  we  use  for  ICA  is  the  following 

9  =  argmin(MD„(Xf,Xf)) 

The  technique  we  use  here  is  similar  to  that  of  a— JR  divergence.  To  further  clarify,  we 
demonstrate  the  separation  performance  of  the  a— Renyi  mutual  divergence  as  ICA  criterion, 
an  experiment  using  the  source  mixed  signal  and  an  a  =  1.6  for  a  mutual  divergence  are 
shown  in  Fig.(7.8),  as  well  as  a  comparison  of  1.6— Renyi  mutual  divergence  and  mutual 


CHAPTER  7.  NEW  MEASURE  CRITERIA  EOR  ICA 


96 


information  given  in  Fig. (7.9).  From  these  two  figures,  we  clearly  see  that  a— Renyi  mutual 
divergence  (1  <  a  <  2)  is  a  better  criterion  than  mutual  information,  and  holds  an  advantage 
over  a- JR  divergence  whose  efficient  estimation  presents  some  challenging  issues. 

7.5  Approximation  of  ct— Renyi  Mutual  Divergence 

7.5.1  Introduction 

In  this  Section,  we  investigate  the  estimation  of  Renyi  mutual  information  using  the  relative 
frequencies  calculated  on  cells  of  adaptive  partitions  of  oi  X  xY .  This  is  a  generalization 
of  a  non-parametric  estimation  of  mutual  information  proved  by  [86]  and  further  implemented 
by  [19]. 

As  described  in  Section  7.3.1,  Renyi  divergence  aims  at  measuring  the  distance  between 
two  probability  density  functions  and  is  given  by 

RDa{p,q)  =  — ^-In 

a  —  1 

with  0  <  a  <  2.  It  is  identically  0  if  the  random  variables  are  equal  in  distribution.  The 
vanishing  property  of  Renyi  divergence  is  also  equivalent,  and  as  noted  earlier,  to  indepen¬ 
dence  of  two  random  variables  when  their  joint  density  and  their  marginal  distributions  are 
invoked.  To  proceed  with  the  description  of  a  non-parametric  alternative  method  to  estimate 
a— Renyi  mutual  divergence  of  two  random  variables  in  this  case,  we  state  the  following  two 
theorems  in  the  following. 

7.5.2  Approximation  Theorems  of  a— Renyi  mutual  divergence 

We  consider  a  pair  of  random  variables  r]  taking  values  in  a  measurable  space  {X  x 
Y,Sx  X  Sy)  with  probability  distributions  P^{.)  and  Pr,{-)  respectively.  Assume  the  joint 
distribution  P^,n{-)  of  rj  is  absolutely  continuous  with  respect  to  the  product  distribution 
P^  X  P-qi-),  then  from  Radon-Nikodym  theorem,  there  exists  a  function  a^^q{x,y),  assuming 
finite  nonnegative  values  and  measurable  relative  to  the  a-algebra  Sx  x  Sy,  such  that  for 
all  B  E  Sx  X  Sy,  the  probability  P^^q{B)  is  given  by  the  integral  of  a^^q{x,y)  over  B  with 
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respect  to  the  measure  x  Pr/{.)  : 


P^^^{B)  =  [  a^^^{x,y)P^x  P^{dx,dy)  (7.5.2) 

Jb 

The  quantity  a^,n{x,  y)  is  called  the  density  of  the  measure  with  respect  to  the  measure 
P^  X  P,,(.)  and  is  conventionally  denoted  by 


dP^  X  P^(.)‘ 


(7.5.3) 


If  we  assume  probabilities  P^^,  P^,  P^  have  probability  densities,  namely,  dP^jj{.)  =  h{x,  y)dxdy, 
dP^{x)  =  f{x)dx,  dPrj{y)  =  g{y)dy,  from  the  dehnition  of  a— Renyi  mutual  divergence  of 
random  variables  X  and  Y  ,  we  have 


MD^{X,Y)  =  —^\n[  h°‘{x,y){f{x)g{y)f  ""  dxdy 

Oi  —  I  JxxY 

=  ^^In  [  a'^~^{x,y)P^^{dx,dy).  (7.5.4) 

O!  —  i-  JxxY 

We  consider  the  approximation  of  a— Renyi  mutual  divergence  based  on  a  sequence  of  hnite 
partition  =  {Cf^}  of  X  X  Y  with  the  property  that  is  a  hner  partition  than 
when  ki  >  k2.  A  hnite  partition  C  of  X  x  R  is  C  =  {Cj}j=i^...  such  that  ljr=i  Ci  =  X  x  Y , 

where  Q,  i  =  !,■■■  ,n  are  subsets  of  X  x  R  such  that  Ci^Cj  =  0,  i  ^  j.  Each  set 

Ci  is  called  a  cell  of  the  hnite  partition  C.  A  partition  is  a  rehnement  (  hner  or  nest 

partition  )  of  another  partition  C^‘^\  if  for  each  C^p  G  there  exists  a  G  such 

that  C  Cf.  C«,C(2)  are  called  nested  partitions.  An  example  of  nested  partitions  is 
shown  in  Fig.  (7.10). 


Theorem  7.5.1.  For  any  partition  C  =  (Ci,  •  •  •  ,  C„)  of  X  xY,  we  have  the  following 
conclusions  for  different  cases  of  a, 

Case.  1.  0  <  a  <  1, 


'  XxY 


a-^-\x,y)P^,{dx,dy)  =  inf  P“  (a)(P^  x  P,(Q))'-“  (7.5.5) 


Case.  2.  1  <  a  <  2, 


'XxY 


aln\^^y)Pivi.dx,dy)  =  sup^P|;(C'i)(P^  x  Pr,{,Ci)f- 


(7.5.6) 
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Figure  7.10:  Example:  Two  nested  partitions  of  7Z^ 

To  practically  implement  Theorem  (7.5.1)  ,  we  need  the  following  theorem  such  that  we 
can  use  the  nested  cell  technique  to  develop  a  algorithm  to  efficiently  approximate  a— Renyi 
mutual  divergence  by  considering  dependent  data. 

Theorem  7.5.2.  For  a  set  of  nested  partitions  of  X  xY,  we  have 
Case  1:  for  0  <  a  <  1, 

mf5;q“(C<)(-P«xa,(C.))'-“=  lim  E  q”(-4,xB,)(Pa.4,)xP,(B,))‘-“  (7.5.7) 

C  A:— >oo  ^  ' 

i  Ci=AiXBj&CW 

Case  2:  for  1  <  a  <  2, 

X  P,(Ci))'-”  =  lim  E  PiM  X  x 

r  A:— >oo  ^  ‘ 

i  Ci=AiXBjGC(k) 

(7.5.8) 

The  proof  details  for  Theorems  (7.5.1)  and  (7.5.2)  are  deferred  to  Appends  A.  and  B. 
Theorems  (7.5.1)  and  (7.5.2)  provide  us  with  an  approach  to  estimate  a-Renyi  divergence 
by  appropriately  dissecting  dependent  data  space  into  sufficiently  small  cells  that  in  the  limit 
achieve  the  ”inf”  of  the  product  given  in  Eq.  (7.5.8). 

To  illustrate  such  a  procedure,  we  generate  a  sequence  of  samples  from  a  pair  of  correlated 
Bi-normally  distributed  random  variables  (X,  Y),  with  the  following  joint  probability  density 
function 


CHAPTER  7.  NEW  MEASURE  CRITERIA  EOR  ICA 


101 


where  v  =  (x  —  y  —  Ty)  and 

al  ra^ay 
ra^ay 

and  where  r  is  the  correlation  coefficient  of  X  and  Y,  and  with  marginal  normal  densities 
f{x),f{y)  with  mean  values  y.x,Ty  and  variances 

To  this  case,  we  may  also  compute  the  theoretical  Renyi  divergence  and  show  it  to  be 


^wlog 
a  —  1 

2  —  a 


f{x,yT{f{x)f{y)y  ^dxdy] 


2a -2 

1 


log(l  —  r^) 

log((l  —  (1  —  Q;)r^)^  —  a^r^). 


(7.5.9) 


2a -2 

A  comparison  of  our  estimate  and  of  the  theoretical  value  are  shown  in  Fig.  (7.11),  and 
demonstrates  that  an  arbitrarily  accurate  estimate  is  achievable. 


7.6  Conclusion 

We  have  proposed  two  more  robust  information  measure  demonstrated  for  ICA,  a— JR  di¬ 
vergence  and  a— Renyi  mutual  divergence,  but  as  we  have  argued,  a— JR  divergence  has  a 
potentially  broader  scope  of  applicability  depending  on  one’s  ability  to  appropriately  inter¬ 
pret  the  probability  measures  and  the  related  parameters  in  the  JR  divergence,  this  problem 
can  be  overcome  by  using  Renyi  divergence  as  we  address  the  numerical  complexity  of  this 
measure  and  propose  a  non-parametric  alternative  implementation. 
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Compare  theoretical  alpha  Renyi  Divergence  with  non-parameter  approximation 


1  1 

^1^  approximation  vaiue 
-e-  theoretical  value 

1  23456789 

Figure  7.11:  Approximated  0.5— Renyi  mutual  divergence  and  its  exact  theoretical  value 


7.7  Appendix  A. 


We  only  prove  case  1  of  Theorem  (7.5.1),  which  is  an  immediate  result  of  the  following 
Lemma  (7.7.1)  and  Lemma  (7.7.2).  Case  2  may  be  similarly  proved. 

If  probabilities  P^^(a;,|/)  and  x  P^^  have  corresponding  probability  densities  p^r}{x,y) 
and  p^{x)pri{y)i  we  see  that 


Pd^)Pv{y) 


(7.7.1) 


and  therefore  Renyi  mutual  divergence  may  be  obtained  as 


/  Vlr,i.x,y){p^{x)pr^{y)y  ^dxdy  = - -In 

IxxY  a  —  i 


ai,vi^,yT  Pi,vidx,dy)  (7.7.2) 


>XxY 


we  have  the  following  two  inequalities  regarding  the  approximation  of  the  integral  in  the 
above  equation,  namely. 
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Lemma  7.7.1.  For  0  <  a  <  1,  and  any  partitions  of  X  x  Y, 

[  al-\x,y)P^,{dx,dy)  <  inf  (a)(P€  x  (7.7.3) 

JxxY  ^  j 

Proof:  Taking  a  function  g{x)  =  x",  which  is  concave  when  0  <  a  <  1  (see  Fig. (7. 12), 
hence  —g{x)  be  a  convex  function),  by  Jensen’s  inequality,  we  have,  for  every  probability 
distribution  F{x), 


Figure  7.12:  Functions  of  /  =  x",  with  0  <  a  <  1  and  1  <  a  <  2 
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and 


u^^dFeiu)  =  {,x,y)P^^{dx,dy) 


(7.7.7) 


Applying  Eq.  (7.7.4)  to  Eq.  (7.7.7)  shows  that,  for  V  i?  G  S'x  x  Sy  with  P^  x  Pri{B)  >  0, 
we  have 

/  oT~^{x,y)Pi:^{dx,dy) 

J  B 

1 

u°‘dFB{u) 

PUB) 


Pi  X  Pv{B)  Jo 
<  P^x  P^{B) 


P^  X  P^{B) 

\  1—a 


=  (q“(B))(Pj  X  (7.7.8) 

We  can  see  that  Eq.  (7.7.8)  is  also  true  when  P^  x  Prj{B)  =  0.  Now  we  consider  a  certain 
dissection  {Cj}  of  the  space  X  x  F.  From  Eq.  (7.7.8),  we  have 

n 

/(O,...  ,C„)  ^  ^(P5"(C,))(A  X  P,(C,))‘-“ 


i=l 


> 


Y]  /  Uv^^^^yUUdx,dy) 
7=1  da 


IXxY 


Uv  i^^y)PUdx,dy) 


(7.7.9) 


since  dissection  {Cj}  of  the  space  X  x  F  is  arbitrarily  chosen,  we  see  that  Eq.  (7.7.3)  is 
established. 


Lemma  7.7.2.  For  0  <  a  <  1,  and  any  partitions  of  X  xY , 

[  alf\x,y)P^,{dx,dy)  >  inf  V  (Q)(P^  x  P,(a))'-“  (7.7.10) 

JxxY  C 

Proof:  For  V  e  >  0,  since  x"  — >  0  as  a;  — >  0  and  — >  0  as  a;  — >  cx)  for  0  <  a  <  1,  we 

can  choose  a  constant  K  small  enough  so  that, 

0<Pf,{a^7'(a;,i/)<X}<|.  (7.7.11) 

Now  consider  a  set  {a'^~^{x,y)  >  K},  which  can  be  represented  in  the  form  of  a  sum  of 
nonintersecting  sets  Q  G  S'x  X  S'y,  i  =  1,  •  •  ■  ,  n  so  that  for  all  i 

hi=  inf  a^r,{x,y),  hi  =  sup  a^n{x,y) 

C,y)^Ci  {x,y)£Ci 


CHAPTER  7.  NEW  MEASURE  CRITERIA  EOR  ICA 


105 


we  have 


(kr-^  -  <  - 

From  Eq.  (7.7.6),  let  B  =  Q,  we  have  that,  for  V  i 

R  <  <  h^, 


X  Prj{Ci) 


and  yielding 


\a—  1 


< 


P^  X  Pj^{Ci) 
Further,  by  the  dehnition  of  h^,hi,  we  have 


Q.—  1 


<  (4i)““  ,  0  <  a  <  1. 


Pi^{Ci){hiY  <  /  {x,y)Pi:r,{dx,dy)  <P^n{Ci){hi) 


cx—l 


(7.7.12) 


(7.7.13) 


(7.7.14) 


(7.7.15) 


’Ci 


From  Eq.  (7.7.14)  and  Eq.  (7.7.15)  we  see  that 
F^“(a)(P«  X  P,(a))'-“  -  [  a^;;\x,y)P^,{dx,dy) 


>Ci 


<  [(h,)“-^-(h.)“-^]P^,(a),  (7.7.16) 


summing  the  inequalities  Eq.  (7.7.16)  over  i  and  using  Eq.  (7.7.12),  we  have 


/(C'l,--  -  ,C'„) 


/  {x,y)P^r,{dx,dy)  < -. 


(7.7.17) 


Let  Cn+i  =  {a'^^  ^  <  77},  we  see  that  the  system  of  sets  Ci,  •  ■  ■  ,  C^,  Cn+i  form  a  dissection 
of  the  space  X  x  Y.  From  Eq.  (7.7.17)  we  have  that 


/(C'l,  •  ■  ■  ,  C,,,  C,,+i)  =  /(Cl,  ■  ■  ■  ,  CO  +  (P0(C„+i)(P^  X  P0C„+i))i-“) 

<  [  +  +  d  (7- 7-18) 

Since  C^+i  =  <  K},  It  is  clear  that 

P“  (C,,+i)(P5  X  P0C,,+i))^-“ 

=  s  a'})(P5  X  p,({a";‘  < 

<  q”(K,-‘ <  A'})  <  ^  (T.7,19) 

Combining  this  inequality  with  inequality  (7.7.11),  we  see  that 

inf/(Ci,---  ,C„+i)  < /(Cl,---  ,C„+i)  <  [  a7~^{x,y)P^rjidx,dy)  +  e,  (7.7.20) 

/{a“-Cic} 

which  in  light  of  the  fact  that  e  is  arbitrary  and  K  may  be  chosen  small  enough,  we  can  see 
that  Eq.  (7.7.10)  is  established. 
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7.8  Appendix  B 

Similarly,  we  only  need  to  prove  Case  1  of  Theorem  7.5.2(as  case  2  is  similarly  proved),  where 
we  need  the  Lemmas  (3)  and  (4) 

Lemma  3.  For  0  <  xi,  X2,  yi,  y2  <  1,  we  have  the  following  inequality  for  0  <  a  <  1, 


xfy\  “  +  a;2i/2  '^  <  {xi  +  X2T{yi  +  y2f  “ 

Proof-.  Let  f{xi,X2,yi,y2)  =  xfyl~^  +  x^yl~^  -  (xi  +  a;2)"(|/i  +  1/2)^““ 
Take  partial  derivative  of  f{xi,X2,yi,y2)  with  respect  to  xi,  we  will  have 


fL{xi,X2,yi,y2) 


=  axf  ^yl  "  -  a(a;i  +  xa)"  (|/i  +  I/2) 


\  1  — Q: 


=  a 


yi 


ol—1 


Xi  +  X2 
1/1  +1/2 


cx—l 


where,  for  0  <  a  <  1,  and  1/1,  1/2  >  0 


f  = 

J  X^ 


/  n  -f  ^  +  ^2  ,  ^yi 

<0  it  —  >  - ,  namely  xi  >  — X2 

yi  yi  +  y2  y2 

^  xi  xi  +  X2  ,  yi 

>0  if  —  <  - ,  namely  Xi  <  — X2 

yi  yi  +  y2  y2 


This  leads  to  the  following  inequalities  for  V  1/1,  1/2  >  0  when  1  >  Xi  >  —X2 


y2 


f{^,x2,yi,y2)  <  f{xi,x2,yi,y2)  <  f{—x2,x2,yi,y2), 

y2 


and  when  0  <  Xi  <  —X2 


y2 


f{0,x2,yi,y2)  <  f{xi,x2,yi,y2)  <  f{—x2,x2,yi,y2)- 

y2 


From  these  two  inequalities,  we  have,  for  0  <  xi,  xa  <  1,  0  <  1/1,  1/2  <  1, 


(7.8.1) 


(7.8.2) 
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and  since 

f(—X2,X2,yi,y2) 

y2 

=  (“^2^  yl~'"  +  X2yl~'"  -  (j^X2  +  X2^  (|/l+|/2)^““ 

xUyi  +  y2)  X^{yi+y2)^,  , 

-  - a - 7 - [yi  +  y2 

y2  y2 

=  0  (7.8.3) 

(7.8.1)  is  immediately  obtained  for  yi,  y2  >  0,  which  is  clearly  true  when  yi  =  0  or  1/2  =  0. 
With  the  help  of  Lemma  (3),  we  have  the  following  Lemma  (4), 

Lemma  4.  For  any  two  nested  partitions  k  >  I  of  X  x  Y ,  we  have  for  0  <  a  <  1, 

P^%(A,  X  B,)(Pj(A.)  X  P„(B,)y- 

Ci=AiXBj£CW 

<  Y.  fS(A  X  X  (7.8.4) 

Ci=AiXBj£CW 

Proof-. 

We  know  that  for  any  cell  Ax  B  E  Ax  B  can  be  written  as 

M 

Ax  B  =  '^AmX  B^ 

m=l 

To  prove  Lemma  4,  we  only  need  to  show  that 

M 

P^^{A  X  B){P^{A)  X  P^(5))i-“  >  5^Pf,(x4™  X  B^){P^{A^)  x  P^(P^))'-“ 

m=l 

If  M  =  2 

2 

P“  (x4^  X  BJ{P^{AJ  X  P,{Bjy-^  (7.8.5) 

m=l 

<  [P“  (x4i  X  Pi)  +  P“  (x42  X  B2)][P^{A,)  X  P,(Pi)  +  Pg(x42)  X  P,(P2)]'-“  (7.8.6) 
In  the  case  that  x4  =  x4i  +  x42,  P  =  Pi  =  P2,  from  Lemma  (3),  we  have 
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[P^^{A,  X  5i)  +  (^2  X  B2WM1)  X  P,(Pi)  +  P^{A2)  X  P,(P2)]'““ 

<  P^“(AxP)[P^(A)xP,(P)]i-“  (7.8.7) 

This  inequality  can  be  repeat  and  proved  that  it  is  true  for  A  =  Ai  +  A2,  B  =  Bi  +  B2, 
thus  can  be  easily  generated  to  a  general  number  M,  thus  Lemma  (4)  follows. 


Chapter  8 

Possible  Future  Work 


In  this  chapter,  we  overview  briefly  the  contributions  of  this  dissertation,  we  also  present 
some  possible  further  development  to  extend  our  work. 

8.1  Summaries 

Our  first  contribution  in  this  work  is  the  construction  of  a  stochastic  framework  for  nonlinear 
diffusions  and  its  utilization  to  solve  a  long  standing  problem  of  an  evolution  stopping  time. 
While  the  previously  mention  contribution  provided  a  significant  gain  in  denoising  and  in 
segmentation,  it  also  need  improves  in  texture  preservation.  In  showing  a  direct  connection 
between  nonlinear  diffusion  and  wavelet  frame  analysis  Altering,  efficient  and  texture  pre¬ 
serving  nonlinear  techniques  were  developed. 

While  nonlinearities  introducing  in  process  were  aimed  at  accounting  for  the  various  depen¬ 
dencies  among  the  components  of  a  signal/image,  an  alternative  approach  would  be,  and  as 
developed  in  chapter  6  and  7,  to  use  the  underlying  PDF’s  to  find  and  separate  independent 
components(ICA) 
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8.2  Possible  Future  Research 

In  this  section,  we  list  several  potential  topics  related  to  this  thesis  that  might  constitute 
good  leads  for  future  research. 

Among  the  many  extensions  one  might  pursue  is  an  8-neighbor  transition  scenario  in  the 
Markov  chain.  Another  interesting  variation  on  the  theme  is  a  two-step  transition  random 
walk  which  may  also  ultimately  be  driven  to  a  continuous  space  setting. 

In  the  nonlinear  wavelet  frame  setting,  wavelets,  with  a  higher  order  of  vanishing  moments 
are  expected  to  yield  a  better  performance,  while  the  analytical  tractability  of  the  problem 
as  resolved  in  chapter  5  remains  an  open  problem. 

While  theoretical  not  limited  in  the  dimensionality  of  the  ICA  problem,  our  1-D  development 
is  in  real  need  to  be  extended  to  ultimately  address  two  and  higher  dimensions  for  applications 
ranging  from  denoising  to  feature  extraction  to  classihcation  and  recognition. 
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